Extracting tables from a PDF document for curation

Parag Kar
2 min readOct 30, 2022

--

There are many such tools available, some free and some paid. What if you can build your own which can do this trick reliably, every time you need it without attaching any cost? The purpose of this note is to share with you a simple code that you can use for achieving this purpose.

What do you need?

You need jupyter notebook installed in your system. For that open a fresh terminal in Mac/Windows and type the following command.

pip install jupyter notebook

Once the jupyter is installed then open the jupyter notebook by typing the following command.

jupyter notebook

Then open afresh notebook to be used for embedding the code. This code needs Camelot and PyPDF2 to work. Camelot you can install it in your system using the embedded link. And PyPDF2 using a simple command as below.

pip install PyPDF2

Then copy-paste the following code.

import camelot as camimport PyPDF2from PyPDF2 import PdfFileReaderinfile ="Data.pdf"
outfile = "Data.csv"
#to clean the output filewith open(outfile,"w") as f:
f.truncate()
#to count the number of pages of the PDF documentreader = PdfFileReader("Data.pdf")
number_of_pages = len(reader.pages)
print(number_of_pages)#to iterate each page of the PDF document and read its contentsfor j in range(0,number_of_pages):

pNo = str(j+1)
table = cam.read_pdf(infile, pages=pNo)
# to append all the tables in a single file with open(outfile,"a") as f:
for i in range(0,1):
table[i].df.to_csv(f)

You are all set to go. Note this works seamlessly if all the tables are spread over different pages and have the same structure — i.e same number of columns and each representing the same dimension. Else you need to break the output file individually for each table. Also, make sure your PDF data file is uploaded in the same default directory, else you have to specify the path for this to work. Hope you find this useful. Thanks.

--

--

Parag Kar
Parag Kar

Written by Parag Kar

EX Vice President, Government Affairs, India and South Asia at QUALCOMM

No responses yet