Python — Heatmap of Indian Companies’ Overseas Investment

5 min readFeb 19, 2023

RBI regularly publishes data on Indian companies' overseas investments. Fortunately, this is in a format that can be easily scrapped and aggregated in a single CSV file using python. But the problem is that RBI does not follow a proper naming convention for uniquely identifying the companies for which this data is displayed, thereby resulting in huge problems if one tries to draw a visualization for checking the trend. The purpose of this note is to discuss how to use python for curating the names of these companies and use this time series data to draw a heatmap for checking the yearly investments of the top 20 companies.

Data Preparation

In this note, I will be skipping the part about how this data can be automatically scraped from the website, but focus on how the names of the companies investing overseas can be harmonized for the purpose of plotting the heatmap. Here is the code for loading the data from the disk into a DataFrame.

#Importing libraries 
import os
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#Changing Directory
os.chdir(
"Path of the csv file in the disk"
)  # working dir
print(os.getcwd())

#Reading file into a DataFrame
df = pd.read_csv("DataR_RBI_OverseasInv_OUT.csv", header = None)

#checking number of rows
df.iloc[:,1].values.shape

The portion of the raw Dataframe looks something like this.

Now column number 1 lists the names of all the individual companies making the investments and column number 9 captures the value of the total investments. As mentioned above we will use python to harmonize the names of the companies listed in column number 1, so that the misspelled names are not treated differently in the output chart.

Harmonizing Names (Col №1)

Here is the code used for that purpose.

#Checking for illegal characters in names of companies and removing them  
import re
listl=[]
for line in df.iloc[:,1].values:
    #Replace all characters with a blank that is not in the list of reg expression
    line = re.sub('[^0-9a-zA-Z]+', ' ',line)
    line = line.split()
    line = " ".join(line)
    line = line.lower()
    line = line.title()
    #Replace some words with abbreviations for harmonizing names of companies  
    line = line.replace("Private","Pvt")
    line = line.replace("Limited","Ltd")
    line = line.replace("Technologies","Tech")
    listl.append(line.strip())
df["New_Des"]=listl

This code removes all illegal characters from the names and replaces some common words with their abbreviations for the purpose of identifying listed names uniquely. It also removes excess blank characters from between the words and structures them in proper cases.

The impact of this curation can be easily ascertained by measuring the number of unique values before and after the process. This is embedded in the following picture.

Figure 2 — No of Unique Companies (Before & After)

One can see that after going through the curation process, the list of unique companies who have made overseas investments within the window period of 2007 & 2023 has reduced from 17920 to 17306.

Saving the DataFrame

After processing, we have to not only save the dataframe in the disk but also convert the data column in datatype and extract the “Calendar Year” from it into a new column of the DataFrame. The code used for this purpose is listed below.

#replacing the original names with the newly processed ones
df.iloc[:,1]=df.iloc[:,-1]
#removing the last column from the dataframe
df =df.iloc[:,:-1]

writeheader = ["SNo","Company", "JV/WOS", "Whether JV/WOS", "Overseas Country",
               "Major Activity","Equity","Loan","Guarantee","Total","Date","URL"]

#storing the processed file in the disk as "final.csv"
df.to_csv("final.csv", index =False, header = writeheader)

#reading the file and converting the date column to datetime
df=pd.read_csv("final.csv")
df['Date'] = pd.to_datetime(df['Date'])
#adding a new column to list the calender year
df['Year'] = (df['Date']).dt.to_period('A')

The snapshot of the processed dataframe is embedded in the following picture.

Preparing Dataframe (for heatmap)

The following is the code used to serve this purpose.

#Preparing the final dataframe for heatmap
df_summary = df.groupby(["Year","Company"])["Total"].sum().unstack().T
df_summary["GTotal"]=df_summary.sum(axis=1)
df_final = (df_summary.sort_values("GTotal",ascending=False)/1000).round(2)

Now in this part, we are grouping the dataframe on “Year” and “Company” by aggregating the “Total” field anchored around this field. Since in a single calendar year, multiple companies are involved in making investments, and therefore to get the year on the x-axis and companies on the y-axis we have to unstack the dataframe and transpose it, so that it is in palatable format for drawing the heatmap. Here is what the final dataframe looks like.

Note — A new column “GTotal” has been added, which is nothing but the sum of all values in a single row that represents the total investments that a specific company has made over the years (since 2007)

Drawing the Heatmap

Here is the code used for this purpose.

#Preparing & Plotting the heatmap chart
fig, ax = plt.subplots(1,1)
plt.tick_params(axis='both', which='major', labelsize=10, labelbottom = False, bottom=False, 
                top = False, labeltop=True)
sns.set(rc={'figure.figsize':(20,10)})
sns.heatmap(df_final.head(20), linewidths=1, linecolor='black',cmap="flare",annot=True)
ax.set(xlabel=None)
ax.set(ylabel=None)
plt.tight_layout()
plt.show()

This process is fairly simple once the dataframe is curated for this purpose. The final heat map for the top 20 companies is sorted by the last column (GTotal) looks somewhat like this.

Note values embedded in each cell indicate the value of the overseas investments these individual companies have made in each calendar year and are measured in $ billion USD. A similar heatmap can be created for other dimensions too.

Tableau Output

The same heatmap can be created using tableau too. But the tool is laced with some levels of inflexibility. However, the main advantage is that it doesn’t require us to develop coding skills which are more onerous. The output from this tool which I use to drill down into the RBI’s overseas investment database is embedded here. The screenshot of the output is described in the figure below.

Note — This dashboard captures the values of the “company groups” vs “individual companies” which is captured in the heatmap above. Hope you found this note useful.

(I am aggregating all the articles on this topic here, for easy discovery and reference.)