The purpose of this note is to create a multivariable model which can be used to make a reasonable prediction for the Indian nominal GDP using python’s “sklearn” module. This library uses a multivariable linear regression model to make such predictions. The value emanates from the fact that this library can be used to train the model using the current historical data — available in various dimensions, such as USD to Rupee Rate, Nominal GDP in Rs Cr, Deflator (Nominal GDP/Real GDP), etc. Once trained, the model is ready for making predictions based on various user-defined inputs, thereby making it very useful for us to see where we are heading and which of the dimensions needs to be tweaked to make the maximum impact on our GDP growth objectives.
Basic Concept
The basic concept used in this note isn’t very complex. But going with conventional code to drive optimization can be very challenging. This will be clear as we elaborate further on the concepts. Linear regression is nothing but trying to fit a line by minimizing its “cost function”. This is nothing but “the mean square value” of the distance of the various points in the scatter plot from the line that we are trying to fit. The following picture lays out the fundamental concept.
One can clearly see that the challenge is that value of m & c (part of the equation of line y=mx+c) has to alter in each iteration in a manner that the “cost function” converges. Otherwise, the loop will diverge making the optimization process impossible to control. The python code that can be used for the above purpose is described below.
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
def gradient_descent(x,y):
m_curr = b_curr =0
iterations = 10000
n = len(x)
learning_rate = 0.001
cost_eof_loop=0
for i in range(iterations):
y_predicted = m_curr*x + b_curr
cost = (1/n)*sum([val**2 for val in (y-y_predicted)])
md = -(2/n)*sum(x*(y-y_predicted))
bd = -(2/n)*sum((y-y_predicted))
m_curr = m_curr - learning_rate*md
b_curr = b_curr - learning_rate*bd
print("m {}, b {}, cost {}, iteration {}".format(m_curr,b_curr,cost,i))
flag = math.isclose(cost_eof_loop,cost,rel_tol=1e-09,abs_tol=0.0)
cost_eof_loop = cost
if flag:
break
pass
x=np.array([1,2,3,4,5,6]) #traning data 1
y=np.array([5,7,9,11,13,14]) #traning data 2
gradient_descent(x,y)
If you run this code using an initial learning rate of 0.001 and the number of iterations of 10000, then you can see the convergence of the cost function happening slowly.
Forecasting India GDP
After having gone through the basic concept, let’s use this to forecast the various dimensions of Indian GDP individually to check whether the linear regression model is suitable for them individually or not, and after having established that we will then use these dimensions to train the “sklearn module” to make future predictions.
Current or Nominal GDP
The following is the code that is used for making predictions of Indian Nominal GDP.
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
#traning data module
reg_GDP_Curr = LinearRegression()
GDP_Curr["Index"]=GDP_Curr.index
reg_GDP_Curr.fit(GDP_Curr[["Index"]],GDP_Curr["GDP_Curr"])
#estimating the score which estimates the quality of optimization
score = reg_GDP_Curr.score(GDP_Curr[["Index"]],GDP_Curr["GDP_Curr"])
print(score)
#making predictions on tranined data
df_data = pd.read_csv("predict_qtr.csv")
#Indexes of the file above is used a input for predictions
#Indexes are nothing but refelective of the dates of the time series data
data = df_data["Index"].values.reshape(-1,1)
dates = df_data["Date"]
predicted_gdp_curr = reg_GDP_Curr.predict(data)
#plotting the predictions on current data
fig, ax = plt.subplots(figsize=(8,5))
ax.grid()
x= dates
y= predicted_gdp_curr
plt.plot(x,y)
x1 = GDP_Curr.index
y1 = GDP_Curr.GDP_Curr
plt.scatter(x1,y1,s =10,color = "blue")
plt.xticks(np.arange(0,85,10))
plt.title("Forcast Analysis (Qtrly GDP_Curr RsCr x10^6) Using Linear Regression Model")
plt.ylabel("GDP Current")
plt.tight_layout()
plt.show()
#storing output in a new dataframe
GDP_Curr_P = pd.DataFrame()
GDP_Curr_P["Index"]=df_data["Index"]
GDP_Curr_P["Date"]=df_data["Date"]
GDP_Curr_P["GDP_Curr"]=predicted_gdp_curr
The output that emanates from running the above code is depicted in the following figure.
The value of the score is estimated as 0.9403951277253917, meaning the trained model is fairly accurate.
Rupee to USD Rate
The code used to train the model for predicting Rupee to USD rate is similar and hence I am leaving that for the purpose of avoiding redundancy. Hence, let’s go straight to the output, which is embedded in the following figure.
The value of the score is estimated as 0.9292632306815055, meaning the trained model is fairly accurate.
GDP Deflator (Nominal GDP/ Real GDP)
Here also the code used to train is similar, and the output is captured below in the following figure.
The value of the score is estimated as 0.9329293262771879, meaning the trained model is fairly accurate.
Nominal GDP Value in USD
Processing this dimension is also important as the output of the final multivariable model will get trained by comparing its output against the calculated value of nominal GDP in USD. Here is the output of the estimated model.
Note the actual values emanated from the average USD rate for that specific quarter for which the values are estimated. Also, note that the value of the score is 0.8832091567867628, meaning the trained model is approximate. Hence, the need is to train a multivariable model to enhance accuracy.
GDP in USD Using Multivariable Model
A portion of the inputs that will go into this model for training it is captured in the data frame below.
The code used to train the model and to process the final output is listed below.
#training dataframe for forecasting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
reg_GDP_Const = LinearRegression()
GDP_USD["GDP_Def"]=GDP_Def["GDP_Def"]
x = GDP_USD[["USD","GDP_Curr","Index","GDP_Def"]]
y = GDP_USD["GDP_USD"]
reg_GDP_USD_m.fit(x,y)
score = reg_GDP_USD_m.score(x,y)
score
#preparing dataframe for forcast
GDP_USD_MulP = pd.DataFrame()
GDP_USD_MulP["Index"] = USD_Rate_P["Index"]
GDP_USD_MulP["Date"] = USD_Rate_P["Date"]
GDP_USD_MulP["USD"] = USD_Rate_P["USD"]
GDP_USD_MulP["GDP_Curr"]=GDP_Curr_P["GDP_Curr"]
GDP_USD_MulP["GDP_Def"]=GDP_Def_P["GDP_Def"]
#spliting dataframe to include only values for forecasting
GDP_USD_MulP = GDP_USD_MulP.iloc[48:,:]
GDP_USD_MulP['Date']= pd.to_datetime(GDP_USD_MulP['Date'])
#forcasting using predicted data from individual models
x = GDP_USD_MulP[["USD","GDP_Curr","Index","GDP_Def"]]
predicted_gdp_usd_mp = reg_GDP_USD_m.predict(x)
GDP_USD_MulP["GDP_USD"]=predicted_gdp_usd_mp
#storing predicted dataframe in df_comb
df_predicted = GDP_USD_MulP
df_exact =GDP_USD.iloc[:,[6,2,1,3,7,5]]
df_comb = pd.concat([df_exact,df_predicted],axis=0)
df_comb.to_csv("gdp_comb.csv", index=False)
#storing prediction model file in the disk as a pickle document
import pickle
with open("gdp_reg_pickle","wb") as file:
pickle.dump(reg_GDP_USD_m,file)
#plotting dataframe to show predicted qtrly nominal GDP in USD
fig, ax = plt.subplots(figsize=(8,5))
dates = df_comb["Date"]
plt.plot(dates,df_comb["GDP_USD"],color ="red")
plt.title("Forecast Analysis (Nominal Qtrly GDP $ Billion) Using MV Linear Regression Model")
plt.ylabel("GDP in USD")
plt.xlabel("Dates")
plt.grid()
plt.tight_layout()
plt.show()
The final output of nominal GDP looks somewhat like this.
Note the values before the inflection point marked in the “black circle” are actual values, and those after the inflation point are the forecasted ones.
Also note that as per this model and based on the assumed inputs as predicted by their individual linear models of (GDP_Curr, USD_Rate, GDP_Def), the Indian nominal GDP will be approximately $ 1.15 T per Qtr by 2032. This means an annual number of $ 4.6 to 4.7 Trillion USD.
The chart below captures the annual aggregated values of actual vs forecasted. In this chart the values before FY2024 are actual and those starting from this year are forecasted numbers.
Summary
The advantage of this analysis is that the model file (which has been trained using multivariable inputs) can be stored as a pickle file in the disc and can be picked up for further analysis without going through the painful process of training it from scratch. This saves a lot of time and can be quite useful for the purpose of forecasting and taking vital policy decisions — aimed at driving actions for causing positive changes in the input variables so as to ensure growth more than what has been predicted by the model.
(I am aggregating all the articles on this topic here, for easy discovery and reference.)