Data Science Interview Questions- Multicollinearity In Linear And Logistic Regression

preview_player
Показать описание
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more

Please do subscribe my other channel too

Connect with me here:

Рекомендации по теме
Комментарии
Автор

Diff 1 --->Gradient descent takes all the data point into consideration to update the weight during back propagation to minimize the loss stochastic gradient descent considers only one data point at a time for weight updation.


Diff 2 ----> In gradient descent convergence towards the minima is as in stochastic gradient descent convergence is slow.


Since in gradient descent whole data points are loaded and use for calculation, computation get as stochastic gradient descent is comparatively fast.

harshstrum
Автор

GD: Run all samples in training to do a single update for all params in a specific iteration
SGD: Only one or subset of training sample from training set to update parameter in a specific iteration
GD: If sample/features are larger it takes much time in updating the values
SGD: It is faster because there is one training sample
SGD conveges faster than GD.

ShivShankarDutta
Автор

multicollinearity may not be a problem every time. The need to fix multicollinearity depends primarily on the below reasons:

When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option
If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

bharathjc
Автор

Hi, Krish
Gradient descent : on big volume of data it takes more number of iterations, for each iteration it works with entire data so casuses High latency and more computing power,
Solution : batch gradient
Batch gradient : data is splitted into multiple batches, on each batch gradient will be applied separately, for each batch separate minimum loss is achieved, it considers finally the weight matrix of global minimum loss
Problem with batch gradient : each batch contains few patterns the entire data, that means missing other patterns, model couldn't learn all patterns from the data

mahender
Автор

In batch gradient descent, you compute the gradient over the entire dataset, averaging over potentially a vast amount of information.
It takes lots of memory to do that. But the real handicap is the batch gradient trajectory land you in a bad spot (saddle point).

In pure SGD, on the other hand, you update your parameters by adding (minus sign) the gradient computed on a single instance of the dataset.
Since it's based on one random data point, it's very noisy and may go off in a direction far from the batch gradient.
However, the noisiness is exactly what you want in non-convex optimization, because it helps you escape from saddle points or local minima
GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large.
That means GD is preferable for small datasets while SGD is preferable for larger ones..

bharathjc
Автор

@2:26 could you please explain what disadvantage can it cause to model performance? I mean, what if I remove correlated features, will my model performance increase or stays the same?

rahuldey
Автор

GD algorithm uses all data for updating weights when optimising loss function in BP algorithm. However SGD uses a sample data at each iteration.

brahimaksasse
Автор

Let's assume we are use a MSE cost function
Gradient Descent -> It takes all the points into account for computing the derivatives of the cost function w.r.t each feature which tells the right direction to move. It is not productive if we have a large number of data points.
SGD -> It computes the derivatives of the cost function w.r.t each feature based a single or some subset of data points and moves in that direction pretending it was the right direction. So, it decreases much of the computational complexity.

sathwickreddymora
Автор

Lasso and Ridge Regression - precondition is that there should not be multicollinearity, if we see linear relationship between the independent variables like how we see it with dependent and independent variables we call it multicollinearity which is not the same as correlation

sridhar
Автор

Sir, kindly make all the videos of feature engineering and Feature selection which is present in your Github Link.. please..

cutyoopsmoments
Автор

Sir can we use pca to reduce multicollinarity if we have suppose more than 200 columns??

sarveshmankar
Автор

Stoic Gradient Descent is a type where the Feature Values are Taken randomly Unlike the other Type of Gradient Descent where the global minima is found out after training the Entire Model.

K-mkpc
Автор

If you have a a large feature space that contains multicollinearity, you could also try running a PCA and use only the first n components in your model (where n is the number of components that collectively explain at least 80% of the variance), since they are by definition orthogonal to each other.

DionysusEleutherios
Автор

Hi Krish...Thanks for such clear explanation. For large datasets, for regression problems we have ridge and lasso. What about classification problem..How to deal with multi collinearity for large datasets?

charlottedsouza
Автор

In addition, can you create a separate playlist for interview question so that it is all in one place?

charlottedsouza
Автор

but how do you actually pick which feature to drop? f1 or f2?

sebastianroubert
Автор

@krishNaik you can add the links for lasso and ridge regularization techniques in this current video. That would be helpful and beneficial for both parties as well I think.

dragonhead
Автор

is it recommended to remove highly negative correlated.

Arjungtk
Автор

That was an clear explanation ..Thanks Krish.. Small request can you make a video for feature selection using atleast 15-20 variables based on multicollinearity for better understanding by practice..

ganeshprabhakaran
Автор

When u are using small dataset and x1, x2 are highly correlated, drop which one?

haneulkim
welcome to shbcf.ru