In this note, I will discuss how to curate a data table extracted from RBI’s website for the purpose of visualization. The idea is to make this process simple by laying down the principles used so that the same can be easily tweaked (if need be) in other scenarios.
Step 1 — Downloading Raw Data
The following is the snapshot of the raw data that I downloaded from the RBI’s website.
One can clearly see it is impossible to process this data in this form. Hence we have to clean it so that the headings are extracted in separate columns.
Step2 — Identifying Patterns
One can clearly see that all the text in the first column starts with numbers in a certain format — 1, 1.1, 1.1.1, 1.2.1, etc. The idea is to pick this pattern using a regular expression so that the notes at the bottom of the table get automatically deleted. The code for the same is reproduced as under.
file = "raw_data.xlsx" # file to process
#regular expression of pattern that we want to include
pattern1 = re.compile(r'''
(
^(\d\s?\.?\d?.?\d?\.?\d?)
)
''',re.VERBOSE)
#regular expression of pattern that we want to exclude
pattern2 = re.compile(r'''
(
:
)
''',re.VERBOSE)
#round 1 processing
df = pd.read_excel(file)
for i, value in enumerate(df.values):
matches1 = re.findall(pattern1,str(value[1]))
matches2 = re.findall(pattern2,str(value[1]))
if not len(matches1)>0 or len(matches2)>0:
#deleting all lines with patterns other than those identified above
df =df.drop(i)
df.reset_index(drop=True, inplace=True) #reseting row index of the df
The output looks like something as shown below.
One can see the bottom part of the data frame is totally cleaned of all garbage.
Step 3 — Extracting the Main & Sub Categories
The following is the code used for extracting the main and subcategories.
#round 2 processing
#Regex pattern to identify main category
main_cat = re.compile(r'''
(
^\d\s
)
''',re.VERBOSE)
#Regex pattern to identify sub category1
sub_cat1 = re.compile(r'''
(
^(\d\.\d\s)
)
''',re.VERBOSE)
#Regex pattern to identify sub category2
sub_cat2 = re.compile(r'''
(
^(\d\.\d\.\d\s)
)
''',re.VERBOSE)
#initialization of variables
maincat =""
subcat1 =""
subcat2 =""
for i, value in enumerate(df.values):
matches_maincat = re.findall(main_cat,str(value[1]))
if len(matches_maincat)>0:
maincat = value[1]
df.loc[i,"maincat"]=str(maincat) #storing main category in seperate col
matches_subcat1 = re.findall(sub_cat1,str(value[1]))
if len(matches_subcat1)>0:
subcat1 = value[1]
df.loc[i,"subcat1"]=str(subcat1) #storing sub cat1 in seperate col
matches_subcat2 = re.findall(sub_cat2,str(value[1]))
if len(matches_subcat2)>0:
subcat2 = value[1]
df.loc[i,"subcat2"]=str(subcat2) #storing sub cat1 in seperate col
The output looks like something as below.
One can see that the three new columns have been created with the main and subcategories extracted from the 1st column.
Step 4 — Deleting All Rows with “Sub Totals”
You can see that in the column listing items that have descriptions marked text called “Total”. We have to pick these text and filter these from the data. The following is a simple code that can be used for this purpose.
# removing all row that has text in column no 1 with contains "Total"
df.reset_index(drop=True, inplace=True)
filt = df.iloc[:,1].str.contains("Total")
df = df[~filt]
Step 5 — Removing Rows with “NaN”
In the data above you can clearly see many rows with all “NaN”. These are simply not required for the purpose of visualization. Here is code that can be used to filter out these lines.
#round 3 processing - removing all rows with null data
pattern_nul = "[nan nan nan nan nan nan nan]"
df.reset_index(drop=True, inplace=True)
for i, line in enumerate(df.iloc[:,[2,3,4,5,6,7,8]].values):
line = "".join(str(line))
if line ==pattern_nul: #checking for the null pattern
df.drop(i, inplace = True)
The output of the above processing is embedded in the following picture.
Step 6 — Final Processing & Storing
Here we will remove the 1st column with all NaN, and “ — “ with “0”. Here is the code below.
#round 4 processing - removing the 1st column with nulls
df.reset_index(drop=True, inplace=True)
df= df.iloc[:,1:]
#round 5 processing - memoving all '–' with 0
df = df.replace(to_replace=r'–', value=0, regex=True)
#storing the file as csv
df.to_csv("final.csv",index = False, header = None)
The final output data looks something like this.
The file is now ready to be used for the purpose of visualization. Hope you found this useful.