Introduction to KL-Divergence | Simple Example | with usage in TensorFlow Probability

preview_player
Показать описание

The KL-Divergence is especially relevant when we want to fit one distribution against another. It has multiple applications in Probabilistic Machine Learning and Statistics. In a later video, we will use it to derive Variational Inference, a powerful tool to fit surrogate posterior distributions.

-------

-------

Timestamps:
0:00 Opening
0:15 Intuition
03:21 Definition
05:28 Example
13:29 TensorFlow Probability
Рекомендации по теме
Комментарии
Автор

Great thanks for so clear illustrations!

ShunLu-vi
Автор

this channel needs more attention, such good contents :D

yongen
Автор

Such a great channel! Sharing this with my friends and colleagues.

maartendevries
Автор

Great video! Best explanation on this topic on Youtube, thank you!

vincentwolfgramm-russell
Автор

When p and q do not overlap (or when q is 0), how does this affect D(p||q)?

InquilineKea
Автор

Potentially silly question... Since the KL is dealing with distances between distributions.... if we are building a model which is forecasting a distribution, whether say a distributional lambda layer, or even a regular dense mutli output with a mu/sigma, and we have a loss function that is the neg log likelihood stuff... that set up is basically forecasting the mu, but making it more bayesian if u will, in that its making a distribution by also giving a sigma which is kind of like a confidence interval, and of course, a dense network output would give the same distribution for the same inputs, where as if we use densevariational layers and other probabilistic layers, the weights and biases would be drawn from distributions at inference, meaning our output can vary just slightly every time we draw a prediction....

however what I am wondering is this...what if a priori we have distribution(s) as inputs to the model, rather than just say, a value we want to forecast and call it mu, we have a fully defined distribution in a training set... are we then back to just say, a multiple output regression where we'd literally just want to min an mse or something around the model learning those distribution variables outright.... or would there be something more meaningful, if the idea of say KL divergence is used as a loss function.

I'm not even sure if this makes sense... but I do think it makes sense in that, how can it be logical to think about optimizing a distribution using a normal style loss function, which would essentially be just minimizing a euclidean distance between the mu and sigma, or whatever distribution we're dealing with.... but it doesnt make sense to treat the mean and std dev this way, because how can you just linearly scale both toward the goal. I feel like there must exist some deeper something going on. Obviously a mu thats further away from reality can be "saved" to an extent by allowing a larger std dev....

And obviously I could imagine taking that last concept, of having actual distributions, and outputting the mu and sigma both as their own separate distributions... which now Im not sure how muddy the water is starting to get. Then I start wondering about the potential for a model to output parameters for a multivariate distribution, and the implications of the results of that vs treating them jointly

I can mention, my motivations for exploring these topics deal with building predictive models around non-stationary data that "randomly" has stationary segments through time. So there can be a large number of sequential inferences made such that the desired output of the std dev wouldn't change much, until it does.... and at the same time, its not just as simple as equating the std dev to a confidence around how accurate the mu prediction will be... because one desired trait of the models I'm building is, for them to obviously be as accurate as possible, but I also don't want them constantly making a new prediction and shifting the values, if you will.... to the extent that, the longer the same prediction going forward, remains valid and doesn't need updated, should be rewarded. Almost imagine a box thru time, and a process existing mostly within said box, doesnt have to be perfect, and then jumps occur where the box shifts.... which doesn't necessarily imply a shift in the std dev though.

So obviously I'm dealing with modeling volatility... and really, one of the BIGGEST things I'm searching for now, is what kind of component, and in what way, I can add to a model, which will allow it to learn the concept of a jump, and not just go nuts on the confidence interval around the mu prediction. Of course the last and most important step to this being the engineering of a custom loss function... which I've really been trying to conceive how it could be possible to utilize something like a UO process logic, to create a framework describing how model outputs should be behaving based on how the underlying process is behaving. I've also been down the rabbit hole of the potential of using a copula to describe how things should be related to each other, and even adding in trainable variables and branches to a network with extra outputs to represent variable relationships between correlations with the progression of time.

Just 3 weeks ago I knew very little about machine learning... and 4 weeks ago I knew little about stochastic processes.... but I've taken the deep dive... and now here we are. And tbh after finding tensorflow probabilities, I feel like a kid in a candy store lol. The possibilities seem endless, and the amount of power at your fingertips is insane... So many ways to build models and engineer data flows thru a model, and defining custom variables that can be trained also, and custom loss functions.... I mean it's literally the entire framework to play god. One thing I will mention though is, I am approaching things from the perspective of hacking the loss function, and using extra data (training data to calculate loss, but not the actual desired target variables) rather than going the reinforcement route. I think the only thing that matters is as long as the loss function is differentiable, then backprop is possible, which means SGD or something similar can be used, since we have either a differentiable reward function or a differentiable loss function. I'm still new to all of this, but not stupid, and would like to assume I have intuition for things lol.

chadgregory