Curating Data For Processing

3 min readNov 5, 2022

We all know that raw data is useless as it doesn’t tell a story. For the data to tell a story we need to process it. But raw data can’t be processed easily as it is unstructured. Processing it manually will be a lengthy and time-consuming process. Hence, we need tools. However, the standard tools might not work as every raw data is differently structured. The purpose of this note is to explain how simple coding can help considerably speed up the process. Please note, I am not an expert on this subject. I am just taking this opportunity to share what I have learned to date or am in the process of learning.

Problem

Let’s say you plan to aggregate the Capital and Revenue expenditure of the Government broken down by ministries for the past three years from the following site. https://cga.nic.in/index.aspx. The screenshot of a portion of this data is shown in the picture below.

The most simple way to process this data is to copy the table into the clipboard, and then paste it into an excel sheet and then manually process it further. You can visualize the quantum of time it will take, given the fact that every month of this table is spread into hundreds of lines.

Solution

Writing Code

In order to curate this data we need to write a few lines of code. This is embedded below.

import re, pyperclip, csv# Create a parsing content of websiteCGARegex = re.compile(r'''
((\s*?[a-zA-Z\s,',()&-/.]+\s+?)
([+-]?[0-9.]+\s+)
([+-]?[0-9.]+\s+)
)
''',re.VERBOSE)# Get the text off the clipboardtext = pyperclip.paste()# Process text with regular expressionextractedText = CGARegex.findall(text)# Curate the process texttemp=""
for line in extractedText:
    line = line[0].replace("\n"," ")
    line = line.strip()
    Index = re.search(r"\d", line)
    Index = Index.start()
    key = line[:Index].strip()
    value = line[Index:].strip()
    if key not in ["Revenue","Capital"]:
        temp = key+";" # Identifying key and seperator ";"
    if key in ["Revenue","Capital"]:
        combline = temp+key+";"+value
        print(combline)

Before running this code, we need to select the text using Cmd+a (Mac). Then copy the text into the clipboard using Cmd+c. Then run this code.

Program Output

The output of this program is shown in the embedded chart.

Notice that the output of the code is now in a form it is easy to process. Why? Each line starts with the department or ministry, which is separated from the “Revenue” or “Capital” tags through the semicolon “;”, and the data that we have to process is separated by a blank “ “. These are structured separators — which can be processed by reading this data into a file and processing it line by line, or, past this dump into excel and then processing it using the text-to-column feature.

Resolving Issues

Now normally the problem would have been solved after completing the above step, but we aren’t that fortunate. The guys who maintain this data on the website do not follow any hygiene of ensuring the same naming convention for all ministries for the past months. Hence, as you process the past months you will find that the names of some of the ministries won’t match. This will make your life difficult. Hence you have to further work to map these names to a list — which you need to update after encountering the mismatches. Having done that the data will be available for processing. See the screenshot under.

Now, this data can be processed easily using any of the shelf tools, like Tableau or R. I Hope you found this information useful. Thanks.