Machine Learning Lecture 13 'Linear / Ridge Regression' -Cornell CS4780 SP17

Показать описание

Lecture Notes:

Рекомендации по теме

Комментарии

There is an error at 36:16 that leads to wrong solution at 37:10. The sum is not taken over P(w). As we write P(y|X, w)*P(w) as Π( P(y_i | x_i, w) * P(w) ), P(w) is constant so it can by taken out of the multiplication, then taking a log will give log(P(w)) + sum( log( P(y_i | x_i, w) ) ). After solving it there will be no "n" in the nominator before "w.T w".

y is not normally distributed, but y|X is. That's why we write P(y|X) in the next step.

also in the MAP approach P(D|w) is not defined, we don't know what this pdf is, this should be P(y|X, w) which is given as normal by our main assumption. Since, D = (X, y) writing P(D|w) means that we stating that we know P(X, y|w) but we have no idea what it is. Later it is defined properly.

another thing is that, to my understanding the concept of "slope" in high dimensional data is meaningless i think, we should use gradients or normal vectors. Thus, in this case vector w is not a slope but a normal to the hyperplane.

at 37:05 "w" is a vector so P(w) is a multivariate gaussian distribution, but univariate is written. Since entries of "w" are i.i.d. we can write it as a multiplication of N univariate gaussians. It won't change much but is rigorous, and when we get ||w||^2 or sum of all w_i under the argmin it is reasonable since we wrote that "w" is multivariate. And just a moment later Professor writes "w^T w", meaning that multivariate normal is P(w). I understand that i have all time in the world to rewind this one lecture and pick up on little things, but i really like to be rigorous and if my nitpicking can help someone i will be really happy.

prwi

This was a fun lecture. I never knew that minimizing the squared error was equivalent to the MLE approach.

jachawkvr

MAP version is different from the lecture notes. I believe lecture notes are correct because when we take the log, since prior and likelihood is multiplied with each other we should get log(likelihood) + log(prior) and summation of the likelihood should not affect the log(prior) so lambda should not be multiplied by n.

If we are not splitting the likelihood and prior and leaving it as log(likelihood x prior), we should get something different then the MLE version, right?

ugurkap

I love it "only losers maximize" ;)

trdr

So, assuming that the noise of the linear regression OLS is Gaussian, when applying MLE we derive the ordinary linear regression, and when applying MAP we derive the regularized(ridge) linear regression

llll-djrn

@kilian weinberger, at circa 28:45 you ask if we have any questions about this. I have 1 question. You use argmin w because you want to find the w that minimizes that loss function, right? If it was a concave function, you would write it as argmax w?

JoaoVitorBRgomes

OMG, the regularisation comes from the Respect 🙇 🙇

erenyeager

Logistic regression is regression in the sense that it predicts probability, which can be used to define the classifier.

flicker

I was thinking about modeling the prediciton of y given x with a Gaussian. Are these observations/reasoning steps correct?

I understand the Gaussianess comes because we have a true linear function that perfectly models the relationship between X and Y, but it is uknown to us.
But we have data (D) that we assume comes from sampling the true distribution (P).
Now, we only have this limited sample of data, so it's reasonable to model the noise as Gaussian. This means that for a given x, our prediction y actually belongs to a Gaussian distribution, but since we only have this "single" sample D of the true data distribution, our best bet is to assign this y as the expectation of the true Gaussian.
Which results in us predicting y as the final prediction (also because a good estimator of the expectation is the average, I guess).

Now, I have explained how in the end we are going to fit the model to the data and predict that, so why do we have to model the noise in the model? Why not make it purely an optimization problem? I guess more like the DL approach.

Aesthetic_Euclides

Is there any way we can get access to the projects for this course?

saransh

Hi Kilian,
I have a doubt. Why do we assume a standard deviation for the noise? Should not we directly calculate it (or them if we allow the variance to be a function of x like for the mean) during the minimisation? Thank you!

FlynnCz

Hi kilian
My Doubt is with respect to how we have derived the mean square loss in the notes. We take P(xi) to be independent of theta. Now, considering P(X) is the marginal for P(X, Y), if the joint is dependent on theta, wouldn't the marginal also depend on theta, so P(X=xi) will also depend on theta with this logic.
Is it that we ASSUME P(X) to be independent of theta, for the parameterized distribution, for the sake of doing discriminative learning or is there some underlying obvious reason that I am missing.

Thank you for your lectures

arihantjha

Hello,
Thank you for the lecture.
Why is the variance equal for all points? 17:23 Is this an assumption that we are taking?

cge

Often when people derive the loss function for linear regression, they just start directly from the minimization of the squared error between the regression line and the points that is, minimize sum((y-yi)^2) . Here, you start with an assumption that the yi has a gaussian distribution and then arrive at this same conclusion with MLE. If we call the former method 1 and latter method 2. Where is the gaussian distribution assumption considered in method 1?

vatsan

36:22 not clear: why P(w) is inside log and thus inside summation? Shouldnt it be a sum of logs of P(D | w) plus logP(w)?
Could anyone please explain why we represent P(D | w) * P(w) = Product [ P (Yi | Xi, w) * P(w)], including P(w) INSIDE the product for every pare (Yi, Xi) from 1 to n?

dmitriimedvedev

Hi Kilian,
I have a doubt, in the probabilistic perspective of linear regression, we assume that for every Xi there is a range of values for Yi,
i.e P(Yi|Xi), where Xi is a D dimension vector, so while solving and arriving at the cost function why are we using univariate Gaussian dist to find the cost function instead of multi-variate Gaussian dist?

PremKumar-vwrt

For Project4(ERM) data_train file is given which consist of Bag of words matrix(X) but we don't have the label(y) for it whats the way around can you please help?

shashankshekhar

how do we know the point to switch to Newtons Method?

saketdwivedi

But if I assume that in MAP the prior is like a poisson is it going to give me the same results as the MLE? Are they supposed to give the same theta/w? @killian weinberg, thank you prof.!

JoaoVitorBRgomes

hahahaha, exactly, statistics always try to mess things up :D:D

gregmakov

Machine Learning Lecture 13 'Linear / Ridge Regression' -Cornell CS4780 SP17

Machine Learning Lecture 13 'Linear / Ridge Regression' -Cornell CS4780 SP17

ML Lecture 13: Unsupervised Learning - Linear Methods

Lecture 13 Linear Machine

Lecture 13 : Linear Machine

MLAI Lecture 13: Linear Regression (no sound)

Introduction to Machine Learning, Lecture-13( Probabilistic Interpretation of Linear Regression)

Cornell CS 5787: Applied Machine Learning. Lecture 13. Part 1: Boosting and Ensembling

Machine Learning course- Shai Ben-David: Lecture 13

Types Of Machine Learning | 61/100 Days of Python Algo Trading

Machine Learning Lecture 14 '(Linear) Support Vector Machines' -Cornell CS4780 SP17

Linear Regression Algorithm – Solved Numerical Example in Machine Learning by Mahesh Huddar

Stanford CS229: Machine Learning | Summer 2019 | Lecture 13-Statistical Learning Uniform Convergence

Stanford CS229: Machine Learning | Summer 2019 | Lecture 4 - Linear Regression

MIT: Machine Learning 6.036, Lecture 13: Clustering (Fall 2020)

CS480/680 Lecture 13: Support vector machines

Lecture 13: Non Parametric Linear System Identification

Lecture 13: Randomized Matrix Multiplication

Applied Machine Learning 2019 - Lecture 13 - Parameter Selection and Automatic Machine Learning

Lecture 13 | Generative Models

Lec-4: Linear Regression📈 with Real life examples & Calculations | Easiest Explanation

Lecture 13 Multiple Linear Regression

Probabilistic ML - Lecture 13 - Computation and Inference

Lecture 13: Convolutional Neural Networks

Cornell CS 5787: Applied Machine Learning. Lecture 13. Part 2: Additive Models