Python — Curating RBI’s Data Tables for Visualization

4 min readFeb 3, 2023

In this note, I will discuss how to curate a data table extracted from RBI’s website for the purpose of visualization. The idea is to make this process simple by laying down the principles used so that the same can be easily tweaked (if need be) in other scenarios.

Step 1 — Downloading Raw Data

The following is the snapshot of the raw data that I downloaded from the RBI’s website.

One can clearly see it is impossible to process this data in this form. Hence we have to clean it so that the headings are extracted in separate columns.

Step2 — Identifying Patterns

One can clearly see that all the text in the first column starts with numbers in a certain format — 1, 1.1, 1.1.1, 1.2.1, etc. The idea is to pick this pattern using a regular expression so that the notes at the bottom of the table get automatically deleted. The code for the same is reproduced as under.

file = "raw_data.xlsx" # file to process

#regular expression of pattern that we want to include
pattern1 = re.compile(r'''
(
^(\d\s?\.?\d?.?\d?\.?\d?)
)
''',re.VERBOSE)

#regular expression of pattern that we want to exclude
pattern2 = re.compile(r'''
(
:
)
''',re.VERBOSE)

#round 1 processing 
 
df = pd.read_excel(file)

for i, value in enumerate(df.values):
    matches1 = re.findall(pattern1,str(value[1]))
    matches2 = re.findall(pattern2,str(value[1]))
    if not len(matches1)>0 or len(matches2)>0:
#deleting all lines with patterns other than those identified above
        df =df.drop(i) 
df.reset_index(drop=True, inplace=True) #reseting row index of the df

The output looks like something as shown below.

Figure 2 — Output After Round 1 Processing

One can see the bottom part of the data frame is totally cleaned of all garbage.

Step 3 — Extracting the Main & Sub Categories

The following is the code used for extracting the main and subcategories.


#round 2 processing

#Regex pattern to identify main category
main_cat = re.compile(r'''
(
^\d\s
)
''',re.VERBOSE) 

#Regex pattern to identify sub category1
sub_cat1 = re.compile(r'''
(
^(\d\.\d\s)
)
''',re.VERBOSE)

#Regex pattern to identify sub category2
sub_cat2 = re.compile(r'''
(
^(\d\.\d\.\d\s)
)
''',re.VERBOSE)

#initialization of variables 
maincat ="" 
subcat1 =""
subcat2 =""

for i, value in enumerate(df.values):
    matches_maincat = re.findall(main_cat,str(value[1]))
    if len(matches_maincat)>0:
        maincat = value[1]
    df.loc[i,"maincat"]=str(maincat) #storing main category in seperate col

    matches_subcat1 = re.findall(sub_cat1,str(value[1]))
    if len(matches_subcat1)>0:
        subcat1 = value[1]
    df.loc[i,"subcat1"]=str(subcat1) #storing sub cat1 in seperate col
    
    matches_subcat2 = re.findall(sub_cat2,str(value[1]))
    if len(matches_subcat2)>0:
        subcat2 = value[1]
    df.loc[i,"subcat2"]=str(subcat2) #storing sub cat1 in seperate col

The output looks like something as below.

Figure 3 — Output After Round 2Processing

One can see that the three new columns have been created with the main and subcategories extracted from the 1st column.

Step 4 — Deleting All Rows with “Sub Totals”

You can see that in the column listing items that have descriptions marked text called “Total”. We have to pick these text and filter these from the data. The following is a simple code that can be used for this purpose.

# removing all row that has text in column no 1 with contains "Total"

df.reset_index(drop=True, inplace=True)
filt = df.iloc[:,1].str.contains("Total")
df = df[~filt]

Step 5 — Removing Rows with “NaN”

In the data above you can clearly see many rows with all “NaN”. These are simply not required for the purpose of visualization. Here is code that can be used to filter out these lines.

#round 3 processing - removing all rows with null data

pattern_nul = "[nan nan nan nan nan nan nan]" 

df.reset_index(drop=True, inplace=True)
for i, line in enumerate(df.iloc[:,[2,3,4,5,6,7,8]].values):
    line = "".join(str(line))
    if line ==pattern_nul: #checking for the null pattern
        df.drop(i, inplace = True)

The output of the above processing is embedded in the following picture.

Figure 4 — After Removing Rows with all NaN

Step 6 — Final Processing & Storing

Here we will remove the 1st column with all NaN, and “ — “ with “0”. Here is the code below.

#round 4 processing - removing the 1st column with nulls    

df.reset_index(drop=True, inplace=True)
df= df.iloc[:,1:]

#round 5 processing - memoving all '–' with 0

df = df.replace(to_replace=r'–', value=0, regex=True)

#storing the file as csv

df.to_csv("final.csv",index = False, header = None)

The final output data looks something like this.

The file is now ready to be used for the purpose of visualization. Hope you found this useful.