Most data available in the public domain is unstructured and unsuitable for processing. Hence, curating data and making it palatable is an important activity one has to do before undertaking the work of visualizing it. In this note, I will discuss how to process tables with labels marked as a category as a row on top of the useful data.
Example
Let's investigate the data as described in the table below.
You can clearly see that if you convert this table from PDF to excel, then you have to deal with the category labels which are marked as shown above. Unless these labels are extracted and reproduced as columns running parallel to the data, the table is of little use. How to do that? Let’s discuss this.
Step1 — PDF to Excel Conversion
If you try converting this table to excel using the best of the tools available you will land up with a problem that you will find impossible to resolve. This problem is described below.
Can you see the problem? All the items show up aggregated in one single cell. Unless these are segregated individually and mapped to each individual row, this table is of no use. Fortunately, there is a workaround. Copy this table into a word file and then paste it back into a new sheet in excel. You will find that all the items will magically get separated out as single rows. This output is shown below.
Step2 — Extracting Categories as Individual Rows
Till now it was all easy, but now comes the hard part for which you have to write some code.
import pandas as pd
import os
from webdriver_manager.chrome import ChromeDriverManager
import re
from time import sleep
from selenium import webdriver
from time import sleep
os.chdir("file path") # path for States2.xlsx
df=pd.DataFrame()
df = pd.read_excel("States2.xlsx", header = None)
df = df.dropna(axis ="rows", how = "all")
list1 = list(range(1, 15))
list2 = ["A","B","C","D"]
list3 = ["i)","ii)","iii)","iv)","v)","vi)","vii)"
,"viii)","ix)","x)","xi)","xii)","xiii)",
"xiv)","xv)","xvi)","xvii)","of which:"]
# initializing variables for storing catagories
cat1 =""
cat2 =""
cat3 =""
cat4 =""
#main program
for i, line in enumerate(df.values):
match1 = re.findall(r'(\d\.)',str(line[0]))
match2 = re.findall(r'(^[A-F]\.)',str(line[0]))
match3 = re.findall(r'^([ivx)]+)',str(line[0]))
match4 = re.findall(r'(^[I]+\.)',str(line[0]))
if len(match1)>0:
letter1 = match1[0][0]
if int(letter1) in list1:
cat1 = line[0]
cat3=""
if len(match2)>0:
letter2 = match2[0][0]
if letter2 in list2:
cat2 = line[0]
cat1=""
cat3=""
if len(match3)>0:
item = match3[0]
if item in list3:
cat3 = line[0]
if len(match4)>0:
cat4 = line[0]
df.loc[i,"Cat1"]= cat1
df.loc[i,"Cat2"]= cat2
df.loc[i,"Cat3"]= cat3
df.loc[i,"Cat4"]= cat4
# replacing odd dash with proper dash
df = df.replace(to_replace=r'–', value='-', regex=True)
df.to_csv("processed.csv",index =False, header = None)
This code does nothing but identifies the patterns of the individual categories (using regular expressions) needed to be separated out as individual rows.
Step3 — Removing Redundant Rows
In order to remove the redundant rows, we need to process them using the following code.
with open("processed.csv","r") as file:
df = pd.read_csv(file)
for i, line in enumerate(df.values):
match2 = re.findall(r'(^[A-F]\.)',str(line[0]))
if len(match2)>0:
letter2 = match2[0][0]
k=0
if letter2 in list2:
k=k+1
if k==1:
df.drop(i, inplace = True)
match4 = re.findall(r'(^[I]+\.)',str(line[0]))
l=0
if len(match4)>0:
l=l+1
if l==1:
df.drop(i, inplace = True)
match5 = re.findall(r'TOTAL REVENUE',str(line[0]))
if len(match5)>0:
df.drop(i, inplace = True)
df = df[pd.notnull(df.iloc[:, 0])] # dropping all rows with NaN in 1st Col
df.to_csv("Final.csv",index =False, header = None)
The idea is to delete the rows of those specific categories which are mapped to their individual items so that the sum of these items faithfully adds up to that of the category which has been removed.
The final process file is listed below in the following table
This file is now in a format that can be easily processed using a data processing tool such as Tableau. Hope you find this useful. Thanks.