Resampling - p.9 Data Analysis with Python and Pandas Tutorial

Показать описание

Welcome to another data analysis with Python and Pandas tutorial. In this tutorial, we're going to be talking about smoothing out data by removing noise. There are two main methods to do this. The most popular method used is what is called resampling, though it might take many other names. This is where we have some data that is sampled at a certain rate. For us, we have the Housing Price Index sampled at a one-month rate, but we could sample the HPI every week, every day, every minute, or more, but we could also resample at every year, every 10 years, and so on.

Another environment where resampling almost always occurs is with stock prices, for example. Stock prices are intra-second. What winds up happening though, is usually stock prices are resampled to minute data at the lowest for free data. You can buy access to live data, however. On a long-term scale, usually the data will be sampled daily, or even every 3-5 days. This is often done to keep the size of the data being transferred low. For example, over the course of, say, one year, intra-second data is usually in the multiples of gigabytes, and transferring all of that at once is unreasonable and people would be waiting minutes or hours for pages to load.

Using our current data, which is currently sampled at once a month, how might we sample it instead to once every 6 months, or 2 years? Try to think about how you might personally write a function that might perform that task, it's a fairly challenging one, but it can be done. That said, it's a fairly computationally inefficient job, but Pandas has our backs and does it very fast.

Рекомендации по теме

Комментарии

Hi Harrison,
Hope your well.
Was wondering if you could do a series on a couple of deep learning algorithms.
Best Regards
Andrew

andrewczeizler

Thank you Sir for the great tutorials..

onmoog-xycs

Hi Harrison,
Just an FYI, when I ran this today I got the following message:

FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).mean()

resample('A', how=mean) worked, but gave the warning.
resample('A').mean() and resample('A').ohlc() both worked as well, but with no warning.

RichardDurham

text version of this tutorial is awesome. thanks

spicytuna

Hi Sentdex, thanks for the videos! Could you advice me on how I should go about averaging the full data for each time period? I.e averaging all the states housing prices for every month. Thanks!

samuelchia

How does it know that the date is the data is monthly? Is the date column in a date type? Because if it is just a string format, then pandas wouldn't know that is is monthly, right?

sameerzahid

problem
"FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).mean()

resample('A', how=mean) worked, but gave the warning.
resample('A').mean() and resample('A').ohlc() both worked as well, but with no warning"
solution
use
data['txyr1']

sayyamahmed

Why does the opening value of year t+1 not match exactly the closing value of year t?
For instance: why is
1975-12-31 close: 6.336776 not equal to
1976-12-31 open: 6.5796775
Or am I assuming wrongly, that opening of a year actually dates back to the first of January, that year?

patrickmullan

I felt the graph displayed is not the mean, it can't be nearly all the monthly data is lower than the annually average.

yuanhu

could you please also do a pairs trading tutorial
many thanks
andrew

andrewczeizler

can you explain kalman filter gps estimation in python

asifnizamani

ok if you have a problem with "TX", replace it with "TX NSA Value"

FlottiFlotta

for some reason the label = 'TX...' isn't producing a label.

LongBoy.

i'm not able to solve this error

File "F:\ML\anaconda\lib\site-packages\pandas\core\resample.py", line 1116, in _get_resampler
"but got an instance of %r" % type(ax).__name__)

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'

hrushikeshgargote

Hi Harrison, I tried re-sampling my data and I got an error: Only valid with DateTimeIndex, I pulled from Keen.IO and it pulls out a timeframe with a start and end date in the same coloumn. Any help you could provide would be awesome!
Thanks!

ciobolurker