Adding other economic indicators - p.14 Data Analysis with Python and Pandas Tutorial

preview_player
Показать описание
Hello everyone and welcome to Part 14 of our Data Analysis with Python and Pandas for Real Estate investing tutorial series. We've come quite a long ways here, and the next, and final, macro step that we want to take here involves looking into economic indicators to see their impact on housing prices, or the HPI.

There are two major economic indicators that come to mind out the gate: S&P 500 index (stock market) and GDP (Gross Domestic Product). I suspect the S&P 500 to be more correlated than the GDP, but the GDP is usually a better overall economic indicator, so I may be wrong. Another macro indicator that I suspect might have value here is the unemployment rate. If you're unemployed, you're probably not getting that mortgage. We'll see though. We've been through the process for adding more data points, so I do not see much point in dragging you all through this process. There will be one new thing to note, however. In the HPI_Benchmark() function, we're changing the "United States" column to be US_HPI. This makes a bit more sense when we're bringing in other values now.

For GDP, I couldn't find one that encompassed the full time frame. I am sure you can find a dataset, somewhere, with this data, maybe even on Quandl. Sometimes you have to do some digging. I also had trouble finding a nice long-term monthly unemployment rate. I did find an unemployment level, but we really want more of a percentage/rate, otherwise we need to divide the unemployment level by the population. We could do that if we decide unemployment rate is worth having, but we'll work with what we get first.

Рекомендации по теме
Комментарии
Автор

Difficult tutorial here, because a lot of the data sources are defunct, so a bit tangled in the middle. Kept going (Persevere!) and eventually got to the HPI.pickle at the end...
Excellent series Harrison - Thx.

markd
Автор

Loving your vids, moving through them quick!

rshrott
Автор

In case anyone using latest releases of various modules in 2016 is getting KeyError: 'Column not found: sp500' or just having issues with running the copy-pasted code, you might need to update your resampling lines. Even though IDLE is only really giving me future warning errors, it's still causing issues elsewhere for some reason. All you really have to do is add .mean() to the end of df=df.resample('M'), etc.

Additionally, if you like, I believe you can replace the two lines of resampling in the Unemployment function with a single line, df=df.resample('M').sum() and obtain the same results

theworstbird
Автор

Hey Harrison, thanks so much for the tutorials.
I got an error as follows,

File "C:\Python27\lib\site-packages\pandas\tseries\resample.py", line 102, in resample
raise TypeError('Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex')
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex

you have any idea why? Thanks!

davez
Автор

Hi Sentdex, why do you use Python IDLE instead of some better Python IDEs?

luguk
Автор

I'm getting (Status 404) (Quandl Error QECx02) You have submitted an incorrect Quandl code. Please check your Quandl codes and try again." on the YAHOO/INDEX_GSPC. I know very little about stock indexes. Any idea for a good replacement?

MarcelJanKr
Автор

For anyone trying to follow this, YAHOO/INDEX_GSPC is no longer available. I ended up using pandas_datareader module to get the data from yahoo. Yahoo also includes adjusted close. Here's the code I used:

# use pip install pandas-datareader if you don't have it already
from pandas_datareader import data as wb

def sp500_data():
ticker = "^GSPC"
df = wb.DataReader(ticker, start='1975-01-01', data_source='yahoo')
df["Adj Close"] = (df["Adj Close"] - df["Adj Close"][0]) / df["Adj Close"][0] * 100.0
df.resample('M')
df.rename(columns={'Adj Close':'sp500'}, inplace=True)
return df['sp500']


For unemployment rate, I found EIA/STEO_XRUNR_M, which is civil unemployment rate. Not the same, but it provides monthly data for the sake of this tutorial.

MrIrishMoore
Автор

Those datasets are no longer available since quandl dropped the YAHOO and GOOG code based ones. Do you have any idea how they've been renamed or something? Thanks in advance!

vladsirbu
Автор

Hi sentdex,
Thank you for the videos! they are awesome. Just a quick question. I have problems with SP500 data frame. Code is almost similar to the code you used and here it is
"def sp500_data():
df1= quandl.get("YAHOO/INDEX_GSPC", trim_start="1975-01-01", authtoken=api_key)
df1["ADJUSTED CLOSE"]= ((df1["ADJUSTED CLOSE"]-df1["ADJUSTED CLOSE"][0])/df1["ADJUSTED CLOSE"][0])*100
df1=df1.resample('M').mean()
df1.rename(columns={'ADJUSTED CLOSE':'sp500'}, inplace=True)
df1=df1['sp500']
return df1
"
and the error is
"Traceback (most recent call last):
File "C:\Users\amrut\AppData\Local\Programs\Python\Python36-32\psp.py", line 97, in <module>
sp500=sp500_data()
File "C:\Users\amrut\AppData\Local\Programs\Python\Python36-32\psp.py", line 69, in sp500_data
df1["ADJUSTED CLOSE"]= ((df1["ADJUSTED CLOSE"]-df1["ADJUSTED CLOSE"][0])/df1["ADJUSTED CLOSE"][0])*100
File "C:\Users\amrut\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "C:\Users\amrut\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\amrut\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "C:\Users\amrut\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "C:\Users\amrut\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\indexes\base.py", line 2136, in get_loc
return
File "pandas\index.pyx", line 132, in (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in (pandas\hashtable.c:13696)
KeyError: 'ADJUSTED CLOSE'
"
Please let me know where I am going wrong. Thanks in advance.

amrutshintre
Автор

Hello for those unsure about how to replace the old quandl requests.

Here are the ones I used, which match the 1990 start date at 12:01 in the video:

SP500 -> df = quandl.get("MULTPL/SP500_REAL_PRICE_MONTH", trim_start="1975-01-01", authtoken=api_key)

USGDP -> df = quandl.get("FRED/GDPSOPQ027S", trim_start="1975-01-01", authtoken=api_key)

US_UNEMPLOYMENT -> df = quandl.get("EIA/STEO_XRUNR_M", trim_start="1975-01-01", authtoken=api_key)

NB. you will have to change the percentage change calculation based on the titles of the columns in each file. All 3 have title columns of value and not the 'Adjusted Close' for S&P.

matthewchisambi
Автор

hello, you can explain more about the module HTMLParser :D

SungHaJunTMT
Автор

Hi Sentdex,
I'm having an issue with running the sample code from the link in the description. Basically, everytime I run the program, HPI_Bench = HPI_Benchmark() encounters a key error in 'United States' I'm on Python 3.

Do you know why this would be happening?

seanturner
Автор

YAHOO/INDEX_GSPC is defunct..What to do?

eastwoodsamuel
Автор

For some reason I have problems with the SP500 dataframe - it's acting weird - as if it's not being recognized as a dataframe anymore. Printing the head() doesn't give any label but does give an extra line: Freq: M, Name: Adjusted Close, dtype: float64

UPDATE: managed to solve that. I had to redefine everything as a dataframe for some reason. df2 = pd.DataFrame(df["Adjusted Close"])

Also - notice that the GDP data is also monthly, so you would need to resample it with df = df.resample('M', how='sum')

RealMcDudu
Автор

In case you are looking for the working code because the unemployment data and yahoo sp500 data are unavailable, here it is. I used alternative data sources.


import pandas as pd
import quandl
import matplotlib.pyplot as plt
from matplotlib import style

style.use('fivethirtyeight')

api_key = 'Your key'

# Code to get Housing Price Index of all fifty states
def getInitialData():
fifty_states = fifty_states[0]

fifty_states = fifty_states['Name &postal abbreviation[12]', 'Name &postal abbreviation[12].1']
fifty_states = fifty_states[0:50]
URLs = []
for abbrv in fifty_states:
URLs.append('FMAC/HPI_' + abbrv)

fifty_states_housing_data = []
i = 0
for url in URLs:
df = quandl.get(url, authtoken=api_key)

df.drop(columns='NSA Value', inplace=True)
df.rename(columns={"SA Value": fifty_states[i]}, inplace=True)

df[fifty_states[i]] = ((df[fifty_states[i]] - df[fifty_states[i]][0]) / df[fifty_states[i]][0]) * 100


i += 1
joined = pd.concat(fifty_states_housing_data, axis=1)



def HPI_Benchmark():
benchmark = quandl.get("FMAC/HPI_USA", authtoken=api_key)
benchmark['SA Value'] = ((benchmark['SA Value'] - benchmark['SA Value'][0]) / benchmark['SA Value'][0]) * 100
benchmark.drop(columns='NSA Value', inplace=True)
benchmark.rename(columns={"SA Value": "US_HPI"}, inplace=True)

return benchmark

def mortgage_30y():
df = quandl.get("FMAC/MORTG", trim_start = "1975-01-01", authtoken=api_key)
df["Value"] = (df["Value"] - df["Value"][0]) / df["Value"][0] * 100
df = df.resample('M').mean()
df.columns = ['M30']
return df

def sp500_data():
df = pd.read_csv('^GSPC.csv')
df.set_index('Date', inplace=True)
df.index = pd.to_datetime(df.index)
df["Adj Close"] = (df["Adj Close"]-df["Adj Close"][0]) / df["Adj Close"][0] * 100.0
df=df.resample('M').mean()
df.rename(columns={'Adj Close':'sp500'}, inplace=True)
df = df['sp500']
return df

def gdp_data():
df = quandl.get("BCB/4385", trim_start="1975-01-01", authtoken=api_key)
df["Value"] = (df["Value"]-df["Value"][0]) / df["Value"][0] * 100.0
df=df.resample('M').mean()
df.rename(columns={'Value':'GDP'}, inplace=True)
df = df['GDP']
return df

def us_unemployment():
df = quandl.get("FRED/UNRATE", trim_start="1975-01-01", authtoken=api_key)
df["Value"] = (df["Value"]-df["Value"][0]) / df["Value"][0] * 100.0
Rate'}, inplace=True)
df=df.resample('1D').mean()
df=df.resample('M').mean()
return df

HPI_data = pd.read_csv('Path to your fifty states data')
HPI_data.set_index('Date', inplace=True)
HPI_data.index =
m30 = mortgage_30y()
sp500 = sp500_data()
gdp = gdp_data()
HPI_Bench = HPI_Benchmark()
unemployment = us_unemployment()
m30.columns=['M30']
HPI = HPI_Bench.join([m30, sp500, gdp, unemployment])
HPI.dropna(inplace=True)
print(HPI.corr())

aaronge
Автор

is there an updated syntax for renaming the fields? cause I keep getting errors with the code. It seems to be due to it not finding the fields like sp500. I go to the sp500 function and just print the dataframe and sure enough the field names aren't changing.. I've been fooling around with this for the past 20 mins and no luck.. If I figure it out before a response I'll post my solution.

MattCamp