info@worthwebscraping.com

Reading PDF File using Python Web Scraping

Reading PDF File using Python Web Scraping

Download Python Script

Send download link to:

We are familiar with PDF’s. In-fact, it is one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system. Learn about Reading PDF File in this tutorial.

Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic.

In one of our previous tutorial we learned how to download a pdf file using requests library. Now if you want to use the data in the pdf file in some meaningful way such as text analysis, creating summary, sentiment analysis etc then you should be able to read the data from pdf using python or any other programming language.

In this tutorial we will learn how to read data from pdf file. To do that we will use a library called PyPDF2. This library is specifically created to work with pdf files. You can read more about this library here https://pypi.org/project/PyPDF2/.

 It is capable of:

  • Reading document
  • splitting documents page by page
  • merging documents page by page
  • cropping pages
  • merging multiple pages into a single page
  • encrypting and decrypting PDF files
  • and more!

We will be reading the same pdf we downloaded in our “Downloading PDF” tutorial. Below is the detailed code for reading the file. Watch the video for further details.

pip install PyPDF2
# importing required modules 
import PyPDF2 
 
# creating a pdf file object 
pdfFileObj = open('python.pdf', 'rb') 
 
# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# creating a page object 
pageObj = pdfReader.getPage(0) 
 
# extracting text from page 
print(pageObj.extractText()) 
 
# closing the pdf file object 
pdfFileObj.close()

Output:

Use this script for reading PDF file then extract data from that. We have expertise in PDF data extraction so if any doubt then use our services.