Dealing with pdf files by Python
Since NLP deals with texts, PDF files will be highly used. We need to figure out a way to read and write to our pdf files in python. Actually, there are many libraries in python that work with PDF files such as:
dfrw,Slate,PDFQuey, PDFMiner, PyPDF2. To read more about it, please visit PDF Python Libraries
Through dfrw library for example, we can read and write PDF using python 2.5, images can be copied from one PDF to another PDF file as well.
PyPDF2 however is a commonly used library which deals with text only with utf-8 encoding and can be installed through the following command:
pip install PyPDF2
To see the python code, please visit my GitHub Link.
After installing it, we have to import the library as follows:
Reading PDF:
I’m going to read the pdf file of Prevention from Covid19.It consists of two pages.
Reading a pdf is similar to reading a text file but instead of choosing the reading mode of ‘r’ as in text file, we will choose the mode of binary reading ‘rb’ because it is not a simple text file but it is a pdf file. To read, we have to open the pdf first then create a reader object for it and pass the file name into it.
my_file is now an open file object held in memory. We’ll run some reading and writing exercises, and then we have to close the file to free up memory.
We can preform some operation such as finding number of pages as follows:
As we can see here that the pdf_reader reads the two pages.
Extracting Texts:
Now if we want to extract texts from our pdf, we can do that through the method of ‘extractText’ but we have to specify the page number first. Let’s read the text of page#1 as follows:
And to read the texts of the second page, it will be same as above with just mentioning page#2 as follows:
As we can see above that ‘\n’ indicates beginning of new line with a new sentence. To get each sentence in a line, we simply apply ‘print’ function:
Now to collect all the text from a particular file of more than one page, we are going to use a ‘for’ loop where we will open a new file, read it, extract text, and print the text of all pages:
The output will be the texts of all pages, which is here two pages:
Writing to PDFs
Unfortunately, python is not able to write directly on the pdf because of the different type of texts that pdfs have. However, extra pages can be appended from other pdfs. To append pages, we need to create our pdf documents.
So we will create first a writer object ‘PdfFileWriter()’ which will allow us to append pdf pages that were produced first by ‘pdfFileReader()’.
As we see above,’second_page’ was created by ‘pdfFileReader() and was copied from our original file’prevention’ through the class of ‘PdfFileWriter()’.
Now, I will create my own pdf document which I’m going to write on it later by appending the ‘second_page’. To do this I will open the pdf file with the name I want and use the ‘wb’ for writing binary mode. We always remember that this is not a text file, but it is a pdf where reading and writing are in binary modes.
After finishing, it is good idea to close the file so that if we try to open it from anywhere else on my computer or usb drive, it will not give a problem because my file is still open on my python file.
Now my pdf document that I created has only one page that we added above. To make sure it was copied, we will read my_doc, and extract its text, and check its number of pages as we did previously in the ‘prevention’ pdf file:
Indeed we write to pdf document by appending a page from another pdf. Note, even though PyPDF2 was able to copy the texts, it couldn’t extract the images.
Sources:
NLP — Natural Language Processing with Python, Udemy , Working With PDFs