Python — Automating the process of extraction of data from FRED
We all know FRED provides us with access to high-quality data. This information can be directly downloaded from the website in a CSV form and used for the purpose of financial analysis. But while doing so one might encounter some challenges. A) To download the file manually each and every time when needed; B) Combining files of different series into a single dataframe for analysis. In this note, we will describe how python can be effectively leveraged to serve both of these purposes.
Downloading Data
Fortunately, FRED provides us with an elegant solution. This they have done by providing the users with an API key which is very easy to get through a process of simple registration. Here is the simple code that one can use for this purpose.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
plt.style.use("fivethirtyeight")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]
from fredapi import Fred
fred_key ="Paste Your API Key Here"
#Create a fred object using the API key
fred = Fred(fred_key)
#search the the FRED system for information on data series
#here we are trying to seach for all series with an S&P tag
df = fred.search("S&P")
The output dataframe df will look something like this.
Note this dataframe has a column named “id” which can be used to extract the time series data from the website. Now before we do that, we need to identify the specific series for which we need this information. One way is to check the names of the various columns in the dataframe which can provide us with some clues as to how to go about identifying the specific series that we need to extract. The various columns of the above dataframe are listed below.
df.columns
Index(['id', 'realtime_start', 'realtime_end', 'title', 'observation_start',
'observation_end', 'frequency', 'frequency_short', 'units', 'units_short',
'seasonal_adjustment', 'seasonal_adjustment_short', 'last_updated',
'popularity', 'notes'], dtype='object')
df.dtypes
id object
realtime_start datetime64[ns]
realtime_end datetime64[ns]
title object
observation_start datetime64[ns]
observation_end datetime64[ns]
frequency object
frequency_short object
units object
units_short object
seasonal_adjustment object
seasonal_adjustment_short object
last_updated object
popularity object
notes object
dtype: object
Note that the column with the name “popularity”. Using this metric we can identify the series with the highest level of popularity. But before we start doing that we need to set the series with the appropriate “datatype”, let’s say “float” or “Integer”.
#Setting the column name as float
df["popularity"]= df["popularity"].astype(float)
#Sorting the dataframe in an acending manner
df = df.sort_values("popularity", ascending = False)
df.head()
The portion of the output dataframe looks something like this.
Now let’s say we want to pick the series with the “ID” name SP500 and download the information using that as the handle, we can do so using the following code.
#dowloading time series data from FRED
spdf = fred.get_series("SP500")
spdf = spdf.dropna()
#Filtering the data for the period more that 1st Jan 2020
filt = spdf.index > "2020-01-01"
spdf = spdf[filt]
#picking the starting reference value for comparision.
ref_value = spdf.values[0]
Plotting Data
The next step is to plot this data for the purpose of visualization. We can do so using the following code.
fig = px.line(spdf, x=spdf.index, y=spdf.values/ref_value,
title='S&P500 - Ratios from the date choosen reference date', height=500
, width =1000, labels={'index': 'Date', 'y':'Ratios'})
fig.update_traces(line=dict(color="Blue", width=2))
#use this website to pick colors from the website below
#https://www.w3schools.com/colors/colors_picker.asp
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='#ccffff')
fig.show()
The output from this code is captured in the following chart.
You can use the same data frame to draw a histogram, which will give us an indication of the levels at which S&P has been hanging around for the most period of time using the following code.
fig = px.histogram(spdf, x=spdf.values, height=600, width=1000,
labels={'x': 'S&P500', 'y':'Ratios'})
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='#e0edfb')
fig.show()
The output histogram that emanates from this code is as under.
Looking at the histogram one can clearly say that S&P has been hanging around at the level of 3900 for most of the period from 1st Jan 2020 till date.
Combining Series
Till now whatever we tried doing can be termed as less interesting and useful. Why? As the real value can only be derived when we combine data emanating from diverse series for enabling us with some detailed analysis.
Now in order for us to do that we have to follow some steps, and the purpose of this section is to lay out these steps.
Let’s say we plan to compare the unemployment rate in the US, and how this has played with the changing value of the index.
Step 1 — Extract the data on the unemployment rate which is described in the code below.
unemp = fred.search("unemployment")
unemp["popularity"]= unemp["popularity"].astype(float)
unemp.sort_values("popularity", ascending = False)
undf = fred.get_series("UNRATE")
undf = undf.to_frame()
undf.columns = ["UNRATE"]
undf
The output dataframe that emanates from this code is the screenshot below.
Step 2 — Convert the spdf dataframe with an index that aligns with the unemp dataframe.
Now the spdf dataframe has an index that is anchored at the frequency of a “day”. We can use the following code python to convert this into a “Business Month”. But will face some issues. This one can find in the snapshot below.
Note the missing value for the month ending May 2021. Looks like the “Business Month” option of python somehow is not able to extract automatically the preceding non-zero value automatically. Hence to set the frequency we can use the following code.
#dictionary to store the values (date as key as S&P index as value)
dic ={}
years =[2020,2021,2022,2023]
months = [1,2,3,4,5,6,7,8,9,10,11,12]
from tqdm.notebook import tqdm
import itertools
import calendar
import datetime
for year, month in tqdm(itertools.product(years, months)):
try:
index = str(year)+"-"+str(month)
date = datetime.datetime(year,month,1).date()
key = date.replace(day = calendar.monthrange(date.year, date.month)[1])
print(d," ", spdf[index][-1])
val = spdf[index][-1]
dic[key] = val
except:
pass
#converting the doctionary into a dataframe
spdf = pd.DataFrame(dic, index=[0]).T
df.columns = ["S&P500"]
#processing the dataframe for calculating the ratios
spdf = spdf/ref_value
The output dataframe is captured in the following picture.
Step 3 — Converting Indexes to Month-Year for both dataframe
We can do so using the following code
#converting the indexes of spdf in datetime
spdf.index = pd.to_datetime(spdf.index, format="%Y-%m-%d")
#extracting Month_Year from both dataframe's indexs
spdf['Month_Year'] = spdf.index.to_period('M')
undf['Month_Year'] = undf.index.to_period('M')
#Setting indexes of both dataframe to Month_Year
undf = undf.reset_index()
undf = undf.set_index("Month_Year")
spdf = spdf.reset_index()
spdf = spdf.set_index("Month_Year")
#joining both dataframe into a single dataframe
comb = pd.concat([spdf,undf], axis =1, ignore_index = False)
#extracting only useful columns
comb = comb.loc[:, ["S&P500", "UNRATE"]]
The output dataframe that emanates from this code is captured below.
Step 4 — Removing rows with NaNs and extracting the period index and converting this into string type.
This we can accomplish using the following code.
filt = comb.index >"2019-12"
comb = comb[filt].ffill()
comb = comb.reset_index()
comb['Month_Year'] = comb['Month_Year'].dt.strftime('%Y-%m')
Step 5 — Plotting the dateframe for the purpose of drawing a comparison
Now one can accomplish this objective using the following code.
import plotly.graph_objs as go
fig = go.Figure()
data = {
'Date':comb["Month_Year"],
'S&P Returns':(comb["S&P500"].values-1)*100,
'Unemp Rate':comb['UNRATE'].values
}
fig.add_trace(go.Scatter(
x=data['Date'],
y=data['S&P Returns'],
name="S&P Returns")
)
fig.add_trace(go.Scatter(
x=data['Date'],
y=data['Unemp Rate'],
name="Unemp Rate")
)
fig.update_layout(
title="Comparision - US S&P Returns & Unemployment Rates",
xaxis_title="Date",
yaxis_title="% Retunrs",
legend_title="Metric Comparision",
font=dict(
family="Arial",
size=20,
color="green"
), height =600, width = 1000
)
fig.show()
The output that emanates from this code is shown in the chart below.
Note this is just a sample of the automation that the user can do and in the process leverage the immense capabilities of the various tools which are available out there totally free — thereby helping one to do complex analysis which otherwise is not possible by looking at the standard output that is available for consumption. Hope you found this useful. Thanks for reading.
(I am aggregating all the articles on this topic here, for easy discovery and reference.)