Expectation Maximization Algorithm | Intuition & General Derivation

preview_player
Показать описание

The Maximum Likelihood is a great first start for fitting the parameters of a model when you only have access to data. However, it breaks down once your model contents latent random variables, i.e., nodes for which you do not observe any data. A remedy is to take the marginal likelihood instead of the full likelihood, but this approach leads to some difficulties that we have to overcome.

In this video, I show how to derive an upper estimate for the marginal log-likelihood, including all the necessary tricks like importance sampling and Jensen's inequality. We then end up in a chicken-egg problem. Hereby, we need the distribution's parameters to perform an estimate, but we also need the estimate to update the parameters. Consequentially, we have to resort to an iterative algorithm which contains of the E-Step (Expectation) and the M-Step (Maximization).

An Important remark is that the derivations I deliver here are just a framework. For each application scenario, for instance for Gaussian Mixture Models, the framework requires a new maximization to then end up with simple update equations.

-------
Info on why the Expectation Maximization algorithm does not work for the Bernoulli-Bernoulli model:

[TODO] I will work on a video on this, stay tuned ;)

-------

-------

Timestamps:
00:00 Introduction
00:48 Latent means missing data
02:15 How to define the Likelihood?
02:55 Marginal Likelihood
05:05 Disclaimer: It will not work
05:48 Marginal Likelihood (cont.)
06:15 Marginal Log-Likelihood
08:11 Importance Sampling Trick
11:31 Jensen's Inequality
13:03 A lower bound (error, see comments below)
15:23 The Posterior over the latent variables
16:20 A lower bound (cont.) (error, see comments below)
17:56 The Chicken-Egg Problem
20:18 Old and new parameters
21:55 The Maximization Procedure
22:56 A simplified upper bound
25:04 Responsibilities
25:46 The EM Algorithm
28:28 An MLE under missing data
29:07 Outro
Рекомендации по теме
Комментарии
Автор

Error at 13:20 : It is a lower bound, not an upper bound. Maximizing an upper bound is not meaningful. See also the comment of @Flemming for more details.

MachineLearningSimulation
Автор

very well produced video! But log is concave so you flipped the sign/direction of Jensen's inequality. In other words, you are finding a lower bound on the log-likelihood. BTW that is in fact arguably desirable as maximizing a lower bound is informative, maximizing an upper bound is not. Maybe that should be clarified for ppl learning this stuff.

flemming
Автор

3:50, Theta bar has two components right? (you said 3 components)

pravingaikwad
Автор

Amazing lovely video. Great job. I feel a bit unlucky that I have not come across your channel earlier.

todianmishtaku
Автор

Wow, I am still amazed how EM works. It’s really brilliant probability.

orjihvy
Автор

Hi felix, this is a nice video on em thanks for that, I question, i dont clearly understand why we have to take only posterior as q(T). why not something else. why posterior only suits q(t)

imvijay
Автор

The video explains formally and in a very clear way the algorithm. My question is, what if we have a mix of missing data, i.e. some missing Words and some missing Thoughts?

lucavisconti
Автор

10:27 I don't think it is right. Summation is for the whole (q * p/q), and we cannot conveniently apply summation to just q alone.

ananthakrishnank
Автор

The video is in high quality.

It is highly appreciated if the summation symbol is written as just Σ. It is a bit confusing when I look at your written summation symbol. I though that it is summing 1 to 1 (haha). But, this confusion does not degrade your video quality.

Thanks

ryanyu
Автор

Hi there ! Can you pls explain why do we have a parameter vector for 'words' but just a single parameter for 'thoughts' ?

Thanks in advance:

kartikkamboj
Автор

I think I see why theta_k is associated with responsibilities, instead of theta_k+1.

orjihvy
Автор

the only one made me understand this evil (E(P/q)= Σq* P/q))
thank you

EngRiadAlmadani