Regression forecasting and predicting - Practical Machine Learning Tutorial with Python p.5

preview_player
Показать описание
In this video, make sure you define the X's like so. I flipped the last two lines by mistake:

X_lately = X[-forecast_out:]
X = X[:-forecast_out:]

In many cases, you wont be able to do this. Imagine if you were using gigabytes of data to train a classifier. It may take days to train your classifier, you wouldn't want to be doing this every...single...time you wanted to make a prediction. Thus, you may need to either NOT scale anything, or you may scale the data separately. As usual, you will want to test both options and see which is best in your specific case.

With that in mind, let's handle all of the rows from the definition of X onward.
Рекомендации по теме
Комментарии
Автор

To all the people who are confused here, He is predicting prices of 1 month into the future . Which is why there is a gap in the graph as he didn't add the last month values from which he is making the next month prediction from ('next month' is the "Forecast" column here). It can be easily fixed by adding the X_lately values to the missing values in the gap.

fahimmdrafiq
Автор

When we defined our X's here, I made a mistake flipping two lines. You *should* have the block of X definitions looking like:

X = np.array(df.drop(['label'], 1))

X = preprocessing.scale(X)

X_lately = X[-forecast_out:]

X = X[:-forecast_out:]

I flipped the final X with X_lately, effectively shifting our forecast out to the point of predicting today's price. Make sure your code looks like the code above!

sentdex
Автор

Everything goes fine for sometime but then he introduces date, time, timestamp and unix .

SauravAgrawalBIT
Автор

If you're getting the data from your local directory instead of getting it directly from Quandl, you should set 'parse_dates=True' while reading it from the csv file. It should be as follows:


data = quandl.get('WIKI/GOOGL')
data.to_csv('googl.csv')
df = pd.read_csv('googl.csv', index_col='Date', parse_dates=True')


This will solve the issue.

maestro
Автор

@sentdex
Maybe at 1:25, the order of assigning values to X_lately and X should be in reverse
because when you say X = X[ : -forecast_out ], the new X doesn't have the features which have corresponding NaN values.

so maybe it should be
X_lately = X[ -forecast_out : ]
X = X[ : - forecast_out]

mohammedcementwala
Автор

Im so glad there's chatgpt now to explain chunks of code now and clarify quickly

itsbxntley
Автор

+sentdex Just another important observation. Correct me if I'm wrong but I think the script doesn't answer the "problem" that you intended to answer.
When you shift by "forecast_out" days, you are NOT predicting the price FOR THE NEXT number of "forecast_out" days, but instead you are predicting the price at T + number of forecast_out days.

So basically, if forecast_out = 5, the code as it is says that if I want to predict the price tomorrow, then I should look at the features 5 days ago.

The plot should have the "label" (not the "Adj. Close") and the forecast.

nrh
Автор

I am *_impressed_* at how good this is at predicting, I just ran the prediction for 30 days that already passed and compared the curve to the actual one shown in Google, and it's exactly the same curve, the only problem is it got skewed in such a way that it predicted the exact same time span as the last known date until now, but as if it were the last known date until half a month ago, I'm not even sure how that's possible

arsnakehert
Автор

All of your videos are great, I can't believe this is free!

granthawkins
Автор

Thanks sentdex for your videos. These are helping in implementing many algorithms. But, I would like to point out one thing:

In this video, one of fundamental reason why accuracy is so high is the use of linear regression on price data. This is not a right way forward if you want to actually trade. Essentially, "accuracy" here represents the correlation between prices and lagged prices (lag = 30 days). Correlation tracks deviation from mean.

Example: If a stock moves from say 400 to 1000 over 5 years. Say, mean is 700. Now, you take a data point for some day say 900, you predict (will be close to 900 obviously, very low probability of predicting below 700). Then, this prediction is counted as a very good prediction in the calculation of "accuracy" . Clearly, this does not add any value.

Correct way is to use returns of the stock and try and predict those.

manojdureja
Автор

I have one question considering the timestamp. Because our Data is just collected every day from Monday to Friday. The timestamp for our forecast though is telling us that on Sunday we got a different Adj. Close (Forecast) value than on Saturday, which is not possible. I think you should exclude the weekend from the Timestamp.

Edit: I like your videos a lot! You are doing a great job! Thank you very much!!

georgkreuzmayr
Автор

Sentdex, sorry to say this but, it is not a prediction to the one month future. At the last step, when you plot them, you made a mistake. df['Adj. Close'] runs from the first row to the "-30" row of the original dataframe. It is 30 days shorter than the original dataframe because of df.dropna(inplace=True) right before you define 'y' in order to make the X len and y len the same. And the "forecast set" should run from the very next date of the last date of the original dataframe and 30 days from that date.

What you plot is: shorter dataframe's Adj. Close + df[Forecast_set]
What you should have plotted is: original dataframe's Adj. Close + df[Forecast_set]
Because what you have predicted is the whole new next 30 days.

Please check with it and give me some feedback on this. I appreciate it.

minjun
Автор

one line "for" totally deadlocked me... c programmer here

aes
Автор

You're literally the most useful and understandable ML teacher on Youtube holy crap

VicPunai
Автор

For those that have a ValueError y not equal to X, you shouldn't df.dropna before defining X and y

Patrick-igcn
Автор

in first 4 mins when you specified:
X_lately = X[-forecast_out_data:]
X = X[:-forecast_out_data]
you changed shape(number of rows) in X its for 30 rows smaller, but you let the y stay the same so when i use:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
i get error because X and y must be the same size, idk how you didnt get error
does someone know what im suposed to do with y, samo as with X or?

ЂорђеБранковић
Автор

Dear Sir,

Thanks for the wonderful tutorial. I have a doubt.

My understanding of what is being done:

Features are of a date suppose K and label is of the date K+30

So while predicting when we are using features of date say 1 October, aren't we predicting label(price) of 31st October ??

So we are actually not predicting for next 30 days but 30 days after the next 30 days.

Please clarify.

lingobol
Автор

very nice video and series, I like that it's in the form of many short videos. PS: I paused the video to understand the loop part before you explained it. I am an older developer but new to Python :)

devnull
Автор

Hey did you ever get this to work I'm trying to figure it out and can't forecasting only to the current date.

noneyabi
Автор

To everyone having problems with the data not predicting but rather repeating the last 30 days, the solution is to go to the the line that has .shift, and delete the negative sign in front of forecast_out. Then come down to df['Adj. Close'] and replace Adj.Close with label
Cheers!

jairad