Curating Tabular Data for Processing

4 min readJan 26, 2023

Most data available in the public domain is unstructured and unsuitable for processing. Hence, curating data and making it palatable is an important activity one has to do before undertaking the work of visualizing it. In this note, I will discuss how to process tables with labels marked as a category as a row on top of the useful data.

Example

Let's investigate the data as described in the table below.

You can clearly see that if you convert this table from PDF to excel, then you have to deal with the category labels which are marked as shown above. Unless these labels are extracted and reproduced as columns running parallel to the data, the table is of little use. How to do that? Let’s discuss this.

Step1 — PDF to Excel Conversion

If you try converting this table to excel using the best of the tools available you will land up with a problem that you will find impossible to resolve. This problem is described below.

Figure 2 — PDF to Excel Converted Output

Can you see the problem? All the items show up aggregated in one single cell. Unless these are segregated individually and mapped to each individual row, this table is of no use. Fortunately, there is a workaround. Copy this table into a word file and then paste it back into a new sheet in excel. You will find that all the items will magically get separated out as single rows. This output is shown below.

Step2 — Extracting Categories as Individual Rows

Till now it was all easy, but now comes the hard part for which you have to write some code.

import pandas as pd
import os
from webdriver_manager.chrome import ChromeDriverManager
import re
from time import sleep
from selenium import webdriver
from time import sleep

os.chdir("file path") # path for States2.xlsx

df=pd.DataFrame()

df = pd.read_excel("States2.xlsx", header = None)
df = df.dropna(axis ="rows", how = "all")

list1 = list(range(1, 15))
list2 = ["A","B","C","D"]
list3 = ["i)","ii)","iii)","iv)","v)","vi)","vii)"
          ,"viii)","ix)","x)","xi)","xii)","xiii)",
          "xiv)","xv)","xvi)","xvii)","of which:"]

# initializing variables for storing catagories 
cat1 =""
cat2 =""
cat3 =""
cat4 =""

#main program

for i, line in enumerate(df.values):
    match1 = re.findall(r'(\d\.)',str(line[0]))
    match2 = re.findall(r'(^[A-F]\.)',str(line[0]))
    match3 = re.findall(r'^([ivx)]+)',str(line[0]))
    match4 = re.findall(r'(^[I]+\.)',str(line[0]))
    
    if len(match1)>0:
        letter1 = match1[0][0]
        if int(letter1) in list1:
            cat1 = line[0]
            cat3=""
            
    if len(match2)>0:
        letter2 = match2[0][0]
        if letter2 in list2:
            cat2 = line[0]  
            cat1=""
            cat3=""
         
    if len(match3)>0:
        item = match3[0]
        if item in list3:
            cat3 = line[0]

    if len(match4)>0:
        cat4 = line[0]
                
    df.loc[i,"Cat1"]= cat1
    df.loc[i,"Cat2"]= cat2
    df.loc[i,"Cat3"]= cat3
    df.loc[i,"Cat4"]= cat4
# replacing odd dash with proper dash
df = df.replace(to_replace=r'–', value='-', regex=True) 

df.to_csv("processed.csv",index =False, header = None)

This code does nothing but identifies the patterns of the individual categories (using regular expressions) needed to be separated out as individual rows.

Step3 — Removing Redundant Rows

In order to remove the redundant rows, we need to process them using the following code.

with open("processed.csv","r") as file:
    df = pd.read_csv(file)
    for i, line in enumerate(df.values):
        match2 = re.findall(r'(^[A-F]\.)',str(line[0]))
        if len(match2)>0:
            letter2 = match2[0][0]
            k=0
            if letter2 in list2:
                k=k+1
            if k==1:
                df.drop(i, inplace = True)
        
        match4 = re.findall(r'(^[I]+\.)',str(line[0]))
        l=0
        if len(match4)>0:
            l=l+1
        if l==1:
            df.drop(i, inplace = True)
            
        match5 = re.findall(r'TOTAL REVENUE',str(line[0]))
        if len(match5)>0:
            df.drop(i, inplace = True)
        
df = df[pd.notnull(df.iloc[:, 0])] # dropping all rows with NaN in 1st Col

df.to_csv("Final.csv",index =False, header = None)

The idea is to delete the rows of those specific categories which are mapped to their individual items so that the sum of these items faithfully adds up to that of the category which has been removed.

The final process file is listed below in the following table

This file is now in a format that can be easily processed using a data processing tool such as Tableau. Hope you find this useful. Thanks.

Curating Tabular Data for Processing

Example

Step1 — PDF to Excel Conversion

Step2 — Extracting Categories as Individual Rows

Step3 — Removing Redundant Rows

Written by Parag Kar

No responses yet