Python — Curating RBI’s Payment Settlement Data for Visualization
India’s payment system has undertaken a revolution. The success is so huge that it has become a poster boy of India’s digitization, especially at the international level. India’s UPI (Unified Payment System) has fundamentally transformed the “payment system” in the country. So much so that most companies have started pricing their goods and services in odd numbers (Rs 457 etc)— which earlier was unthinkable. In order for us to review the progress of these payment systems, RBI publishes data on a daily basis- embedded in excel sheets. These sheets are not only in different tabs but are also difficult to process for visualization in the raw form. The purpose of this note is to describe how we can not only clean this data easily using python, but also automate the process of updating and feeding it into the visualization tool without any further manual intervention.
Raw Data
The raw data can be downloaded from the RBI’s page. The link to the file is embedded here — “Settlement Data of Payment System”. The screenshot of this excel file is below.
One can clearly see that this sheet has data in merge cells and in different sheets which is totally unsuitable for direct visualization.
Step 1 — Aggregating All in one Sheet
The first step is to aggregate all the data spread out in multiple sheets in one. We can do this easily using the following code.
import pandas as pd
from datetime import datetime
file = "PSDDP04062020.xlsx"
xls = pd.ExcelFile(file) #load multisheet excel file
sheets = xls.sheet_names #store all the sheet names in a list called "sheets"
process_from = 'October 2021' #choose the date from which you want sheets data to aggregrate
#calculate the value of index for spliting the list sheets with values we want to process
process_from_idx = [idx for idx, sheet in enumerate(sheets) if sheet == process_from][0]
#splitting the sheet
sheets = sheets[process_from_idx:]
print(sheets)
#initialization of dataframe object
df=pd.DataFrame()
#running a for loop to aggregrate data in all selected sheets
for sheet in tqdm(sheets):
df1 = pd.read_excel(file, sheet_name=sheet, header = None)
df = pd.concat([df,df1])
df = df.astype({0: str}) #change data type of the first column to string
The screenshot of the output data frame is embedded in the following picture.
You can see that this out is quite cluttered with a lot of Nulls and garbage. But for now, we will focus on column №1 — marked Red.
Step 2 — Deleting Rows with Null in 1st Column
The idea is to keep only those rows that have valid dates. This way we can remove all the garbage (rows with headers, and remarks) from the data frame. The code used for this purpose is under.
#converting the 1st collumn of dataframe into datatype as datetime
datetimes = pd.to_datetime(df.iloc[:,0], errors='coerce')
#filtering all those rows which have NaT values (i.e filtering all rows which are garbage)
df = df.loc[datetimes.notnull()]
#resetting the index of all rows after filtering
df.reset_index(drop=True, inplace=True)
#storing the values of all dates in a list called datetimes
datetimes = pd.to_datetime(df.iloc[:,0], errors='coerce')
#replacing the original values of dates with the list datetimes
df.iloc[:,0]=datetimes
#replacing all cells marked as holidays with value "0"
df=df.replace(to_replace="h",value=0)
The screenshot of the output data frame is embedded in the following picture.
Step 3— Processing Headers & Storing in Disk
This is the final step, here we would like the code to detect automatically all the headers of the data and link these to individual columns for final storage and processing. The code used for this purpose is under.
#processing for dataframe headers
#processing for header No.1
writeheader1=[]
writeheader1 = df1.iloc[4,1:].fillna(method='ffill')
writeheader1 = [item for item in writeheader1 +";"]
#processing for header No.2
writeheader2=[]
writeheader2 = df1.iloc[5,1:].fillna(method='ffill')
writeheader2 = [item for item in writeheader2 +";"]
#procesing for header No.3
writeheader3=[]
writeheader3 = df1.iloc[6,1:]
#concating all items of above list into one as "writeheader"
writeheader=[]
writeheader = [str(i)+str(j)+str(k) for i, j , k in zip(writeheader1,writeheader2,writeheader3)]
#cleaning the list writeheader of garbage
for i in range(len(writeheader)):
writeheader[i] = writeheader[i].replace("nan","")
writeheader[i] = writeheader[i].replace(" ","")
writeheader[i] = writeheader[i].replace("-","")
#appending the Date collumn header as the first item in the list "writeheader"
writeheader.insert(0,"Date")
#storing the final dataframe in the disk as an outputfile
df.to_csv("outputfile.csv", index=False, header=writeheader)
The screenshot of a portion of the final processed file is embedded in the following picture.
Note that all the headers (main & sub-category) have been concatenated using a separator “;”. This has been done so that you can split the headers easily in the processing tool for final visualization.
Visulized Dashboard
The following is how the visualized Dashboard looks when this is processed using “Tableau”. The live link to this Dashboard is posted on my website here.
Hope you found this useful. Will keep posting more such tips as and when possible.
(I am aggregating all the articles on this topic here, for easy discovery and reference.)