Machine Learning Lecture 13 'Linear / Ridge Regression' -Cornell CS4780 SP17

preview_player
Показать описание
Lecture Notes:
Рекомендации по теме
Комментарии
Автор

There is an error at 36:16 that leads to wrong solution at 37:10. The sum is not taken over P(w). As we write P(y|X, w)*P(w) as Π( P(y_i | x_i, w) * P(w) ), P(w) is constant so it can by taken out of the multiplication, then taking a log will give log(P(w)) + sum( log( P(y_i | x_i, w) ) ). After solving it there will be no "n" in the nominator before "w.T w".

y is not normally distributed, but y|X is. That's why we write P(y|X) in the next step.

also in the MAP approach P(D|w) is not defined, we don't know what this pdf is, this should be P(y|X, w) which is given as normal by our main assumption. Since, D = (X, y) writing P(D|w) means that we stating that we know P(X, y|w) but we have no idea what it is. Later it is defined properly.

another thing is that, to my understanding the concept of "slope" in high dimensional data is meaningless i think, we should use gradients or normal vectors. Thus, in this case vector w is not a slope but a normal to the hyperplane.

at 37:05 "w" is a vector so P(w) is a multivariate gaussian distribution, but univariate is written. Since entries of "w" are i.i.d. we can write it as a multiplication of N univariate gaussians. It won't change much but is rigorous, and when we get ||w||^2 or sum of all w_i under the argmin it is reasonable since we wrote that "w" is multivariate. And just a moment later Professor writes "w^T w", meaning that multivariate normal is P(w). I understand that i have all time in the world to rewind this one lecture and pick up on little things, but i really like to be rigorous and if my nitpicking can help someone i will be really happy.

prwi
Автор

This was a fun lecture. I never knew that minimizing the squared error was equivalent to the MLE approach.

jachawkvr
Автор

MAP version is different from the lecture notes. I believe lecture notes are correct because when we take the log, since prior and likelihood is multiplied with each other we should get log(likelihood) + log(prior) and summation of the likelihood should not affect the log(prior) so lambda should not be multiplied by n.

If we are not splitting the likelihood and prior and leaving it as log(likelihood x prior), we should get something different then the MLE version, right?

ugurkap
Автор

I love it "only losers maximize" ;)

trdr
Автор

So, assuming that the noise of the linear regression OLS is Gaussian, when applying MLE we derive the ordinary linear regression, and when applying MAP we derive the regularized(ridge) linear regression

llll-djrn
Автор

@kilian weinberger, at circa 28:45 you ask if we have any questions about this. I have 1 question. You use argmin w because you want to find the w that minimizes that loss function, right? If it was a concave function, you would write it as argmax w?

JoaoVitorBRgomes
Автор

OMG, the regularisation comes from the Respect 🙇 🙇

erenyeager
Автор

Logistic regression is regression in the sense that it predicts probability, which can be used to define the classifier.

flicker
Автор

I was thinking about modeling the prediciton of y given x with a Gaussian. Are these observations/reasoning steps correct?

I understand the Gaussianess comes because we have a true linear function that perfectly models the relationship between X and Y, but it is uknown to us.
But we have data (D) that we assume comes from sampling the true distribution (P).
Now, we only have this limited sample of data, so it's reasonable to model the noise as Gaussian. This means that for a given x, our prediction y actually belongs to a Gaussian distribution, but since we only have this "single" sample D of the true data distribution, our best bet is to assign this y as the expectation of the true Gaussian.
Which results in us predicting y as the final prediction (also because a good estimator of the expectation is the average, I guess).

Now, I have explained how in the end we are going to fit the model to the data and predict that, so why do we have to model the noise in the model? Why not make it purely an optimization problem? I guess more like the DL approach.

Aesthetic_Euclides
Автор

Is there any way we can get access to the projects for this course?

saransh
Автор

Hi Kilian,
I have a doubt. Why do we assume a standard deviation for the noise? Should not we directly calculate it (or them if we allow the variance to be a function of x like for the mean) during the minimisation? Thank you!

FlynnCz
Автор

Hi kilian
My Doubt is with respect to how we have derived the mean square loss in the notes. We take P(xi) to be independent of theta. Now, considering P(X) is the marginal for P(X, Y), if the joint is dependent on theta, wouldn't the marginal also depend on theta, so P(X=xi) will also depend on theta with this logic.
Is it that we ASSUME P(X) to be independent of theta, for the parameterized distribution, for the sake of doing discriminative learning or is there some underlying obvious reason that I am missing.

Thank you for your lectures

arihantjha
Автор

Hello,
Thank you for the lecture.
Why is the variance equal for all points? 17:23 Is this an assumption that we are taking?

cge
Автор

Often when people derive the loss function for linear regression, they just start directly from the minimization of the squared error between the regression line and the points that is, minimize sum((y-yi)^2) . Here, you start with an assumption that the yi has a gaussian distribution and then arrive at this same conclusion with MLE. If we call the former method 1 and latter method 2. Where is the gaussian distribution assumption considered in method 1?

vatsan
Автор

36:22 not clear: why P(w) is inside log and thus inside summation? Shouldnt it be a sum of logs of P(D | w) plus logP(w)?
Could anyone please explain why we represent P(D | w) * P(w) = Product [ P (Yi | Xi, w) * P(w)], including P(w) INSIDE the product for every pare (Yi, Xi) from 1 to n?

dmitriimedvedev
Автор

Hi Kilian,
I have a doubt, in the probabilistic perspective of linear regression, we assume that for every Xi there is a range of values for Yi,
i.e P(Yi|Xi), where Xi is a D dimension vector, so while solving and arriving at the cost function why are we using univariate Gaussian dist to find the cost function instead of multi-variate Gaussian dist?

PremKumar-vwrt
Автор

For Project4(ERM) data_train file is given which consist of Bag of words matrix(X) but we don't have the label(y) for it whats the way around can you please help?

shashankshekhar
Автор

how do we know the point to switch to Newtons Method?

saketdwivedi
Автор

But if I assume that in MAP the prior is like a poisson is it going to give me the same results as the MLE? Are they supposed to give the same theta/w? @killian weinberg, thank you prof.!

JoaoVitorBRgomes
Автор

hahahaha, exactly, statistics always try to mess things up :D:D

gregmakov