Python — Processing GOI’s Accounting Data for Visualization

a) To extract incremental values from the cumulative and b) To calculate monthly progress as a percentage of budgetary estimate

Parag Kar
4 min readFeb 17, 2023

Every year, the GOI comes out with its annual budget. The progress of which is monitored by the Controller General of Accounts on a monthly basis. While doing so the CGA monitors the cumulative progress of various items stipulated in the budget. Now, this needs to be not only converted into incremental values but also as a percentage of budgetary estimate for us to be able to measure progress. The challenge is that — if we use excel to make such calculations then manual intervention becomes mandatory, each time the data updates. This makes the “process of updation” onerous and prone to errors. Hence, we have two options — a) using the “calculated field” option of Tableau to drive that objective — very difficult due to the large number of columns involved; b) doing the processing using “python” — which is easy, scalable, and flexible. On this note, we will understand how simple code written in python can accomplish this objective very easily.

Raw Data

The raw data extracted from the CGA website looks somewhat like this.

Figure 1 — Raw Data

Note this data has 42 columns to process. And will need 21+21 columns to store the processed data — making Excel and Tableau very difficult and not seamless.

Processing Data (Calculating Increments)

The following is the python code that we will use to process this data.

import pandas as pd
import os
from collections import defaultdict

pd.set_option('display.max_columns', 500)

#changing dir
os.chdir("Enter Dir Path") #working dir path
print(os.getcwd())

#loading the raw csv file to a dataframe
df = pd.read_csv("T01_CGA_GovtAccounts.csv")

#1. Converting Cumluvative Values into Incremental and storing in a new column

#selecting the columns of the dataframe to process
columns_to_process = [x for x in df.columns if x.endswith('-Cum')]
#creating temp copy of the dataframe
tdf = df.copy()

#initializing a dictionary for storing the list of columns to process
matched_columns = defaultdict(list)

#extracting the column name prefix to be used as a key for the dictionary
#all items of this dictionary are column names to process and are
#mapped to the prefix which is the key of the dictionary
for x in columns_to_process:
prefix = x[:x.find('-')]
matched_columns[prefix].append(x)

#processing the cumulative columns data into incremental values

for prefix, cols in matched_columns.items():
df1 = tdf[["Month", cols[0]]]
df1["Month"].astype("int")

#The logic used is if month value is 4, then do nothing, else
#Incremental value is diff between the current and previous values of DF
list_inc=[]
for i, value in enumerate(df1.values):
if value[0] !=4:
inc = df1.values[i][1]-df1.values[i-1][1]
if value[0] ==4:
inc = df1.values[i][1]
print(df1.values[i][0]," ",df1.values[i][1]," ", inc)

list_inc.append(inc)
#storing the increments into a new column with a suffix "INC"
tdf[prefix + '-INC'] = list_inc

After you run this code the new columns that get created with populated data are embedded in the following picture.

Figure 2 — New Columns With Polulated Inc Values

The output DataFrame populated with incremental values will look somewhat as described below.

Figure 3 — Output Dataframe with Incremental Values

Processing Data (Calculating %Cum of BE)

The following is the python code that we will use to process this data.

#2. Calculating the Cumulative Values as a percentage of Budgetary Estimate
#selecting all columns with suffix "BE" and "Cum" in the DataFrame
columns_to_process = [x for x in df.columns if x.endswith('-BE') or x.endswith('-Cum')]
tdf = df.copy()
matched_columns = defaultdict(list)
for x in columns_to_process:
prefix = x[:x.find('-')]
matched_columns[prefix].append(x)
#the process till this line is same as above

#storing the calculated values in new columns with suffic "Perc"
for prefix, cols in matched_columns.items():
print(prefix, cols)
tdf[prefix + '-Perc'] = tdf[cols[1]].values.astype(float) / tdf[cols[0]].values.astype(float)

After you run this code the new columns that get created with populated data are embedded in the following picture.

Figure 4 — New Columns With Polulated % Cum of BE

The output DataFrame populated with the values will look somewhat as described below.

Figure 5— Output Dataframe with % Cum of BE

Final Visualized Output

The final visualized output with cumulative values is embedded in the picture below.

Figure 6 — The Final Output

Note — In order to preserve the scalability of our output and to prevent any further manual intervention, it is important to keep the input data in a CSV format — deprived of any static formulas. All processing for the purpose of visualization has to happen either directly in the visualization tool. Hope you found this useful. Thanks for reading.

(I am aggregating all the articles on this topic here, for easy discovery and reference.)

--

--

Parag Kar
Parag Kar

Written by Parag Kar

EX Vice President, Government Affairs, India and South Asia at QUALCOMM

No responses yet