R Stats: Multiple Regression - Variable Preparation

preview_player
Показать описание
This video gives a quick overview of constructing a multiple regression model using R to estimate vehicles price based on their characteristics. The video focuses on how to prepare variables while employing a stepwise regression with backward elimination of variables. The lesson explains how to transform highly skewed variables (using Log10 transform) and later report their characteristics, how to check variable normality and their multiple collinearity (using Variance Inflation Factors) and their extreme values (using Cook's distance). The process will be guided by the measures of model quality, such as R-Squared and Adjusted R-Squared statistics, and variables' p-values, which represent the level of coefficient confidence. As always, the final model will be evaluated by calculating the correlation between the predicted and actual vehicle price for both the training and validation data sets, with correction for the previously transformed variables. The explanation will be quite informal and will avoid the more complex statistical concepts. Note that visual presentation and interpretation of multiple regression results will be explained in the next lesson.

The data for this lesson can be obtained from the well-known UCI Machine Learning archives:

The R source code for this video can be found here (some small discrepancies are possible):

Рекомендации по теме
Комментарии
Автор

Hi Sir, I see that the final model for your multiple regression after backward elimination only used two variables : Peak.rpm and Curb.weight. When testing the final model with the valid/test set, can't we just do this
valid.sample$Pred.Price <- predict(fit, newdata = valid.sample) ?

Why did you do this instead?

valid.sample$Pred.Price <- predict(fit, newdata = subset(valid.sample, select=c(Price, Peak.rpm, Curb.weight)))

Do you mind explaining why did you need to subset the valid.sample set with just the variables that the model end up using? Like, why does it matter? Thanks!

killa
Автор

Very helpful video. Thank you for posting!

allisonhaaning
Автор

hello prof. Thank you for all of your lessons. These are really helpful. My question is how we do the back transformation for log10 for report requirements? or how the model equation looks like? thank you in advance.

benediktusnugrohoadiwiyoto
Автор

Hi professor, thanks for the great tutorial.
Just for curiosity, why do you use the number 2017 in set.seed()?
Many thanks

klaldju
Автор

Hello sir, how do we check for non linearity if the variables are factors instead of numerical?

Or do we just do the full model and then check for linearity from that full model?

harithsyafiqhalim
Автор

many thanks nice videio. can u please check the link for r source code. it is not working. thanks

muhammadsaleemkhan
Автор

Since this video was created the UCI Machine Learning repository moved to the new location. What it means is that the web location shown in the script is not working. However, I have updated the link to the lesson data in the video description.

ironfrown
Автор

what if after eliminating some extreme values, the R-squared instead becomes smaller ?

mohammadumam
Автор

shouldn'it be sqrt(vif(fit))) rather?

sambad
Автор

Great video but the link to the data is not working. I have cleaned and prepared the data for saving your guys time.
The data is format to be suitable to the R code in a description above.
Data's name is Auto.csv with 205 rows and 26 cols

After importing the data to R and in a process of imputing the NA value, please notice:
auto$Num.of.doors <- as.numeric(impute(auto$Num.of.doors, median)) in the R source code did not wok due to the class of Num.of.doors is character. You have to change to
auto$Num.of.doors <- as.character(impute(auto$Num.of.doors, median)) for functionally working.

xymabuka