DDPM - Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained)

preview_player
Показать описание
#ddpm #diffusionmodels #openai

GANs have dominated the image generation space for the majority of the last decade. This paper shows for the first time, how a non-GAN model, a DDPM, can be improved to overtake GANs at standard evaluation metrics for image generation. The produced samples look amazing and other than GANs, the new model has a formal probabilistic foundation. Is there a future for GANs or are Diffusion Models going to overtake them for good?

OUTLINE:
0:00 - Intro & Overview
4:10 - Denoising Diffusion Probabilistic Models
11:30 - Formal derivation of the training loss
23:00 - Training in practice
27:55 - Learning the covariance
31:25 - Improving the noise schedule
33:35 - Reducing the loss gradient noise
40:35 - Classifier guidance
52:50 - Experimental Results

Abstract:
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for sample quality using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.85 on ImageNet 512×512. We release our code at this https URL

Authors: Alex Nichol, Prafulla Dhariwal

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

OUTLINE:
0:00 - Intro & Overview
4:10 - Denoising Diffusion Probabilistic Models
11:30 - Formal derivation of the training loss
23:00 - Training in practice
27:55 - Learning the covariance
31:25 - Improving the noise schedule
33:35 - Reducing the loss gradient noise
40:35 - Classifier guidance
52:50 - Experimental Results

YannicKilcher
Автор

My boyfriend wrote these papers. Go Alex Nichol!

SamanthaTries
Автор

Summary: self-supervised learning. Given dataset of good images, keep adding Gaussian noise to it to create sequences of increasingly noisy images. Let the network learn to denoise images based on that. Then the network can "denoise" completely Gaussian random pictures into real pictures.

To do: learn some latent space (like VAEGAN does) so that it can smoothly interpolate between generated pictures and create nightmare arts.

CosmiaNebula
Автор

That notation \mathcal{N}(x_t;sqrt{1-\beta_t}x_{t-1}, \beta_t \mathbf{I}) sets my teeth on edge. Doing this with P, a general PDF, is fine, but I would always write x_t ~ \mathcal{N}(sqrt{1-\beta_t}x_{t-1}, \beta_t \mathbf{I}), since \mathcal{N} is the Gaussian _distribution_ with a defined parameterization. BTW, the reason for sqrt{1-\beta_t}x_{t-1} is to keep the energy of x_{t-1} approximately the same as the energy for x_t; otherwise, the image would explode to a variance of T*\beta after T iterations. It's probably a good idea to keep the neural network inputs to about the same range every time.

scottmiller
Автор

Thanks a lot for the thorough explanation!

It's helping me figure out a topic for my master's degree.

Much much appreciated ^^

ahmedalshenoudy
Автор

yannic, thanks for the video. the audio is a little soft even at max volume (unless I'm wearing my headphones). is it possible to make it a bit louder?

linminhtoo
Автор

Historic video! Fun to see it now and compare it to the current state of image generation. I’ll check it again in two years to see how far we’ve got.

pedrogorilla
Автор

18:46 I guess it’s very likely to be related to Shannon’s Sampling theorem, reconstructing the data distribution by sampling with the well defined normal distribution. The number of time steps and Beta closely related to the band width of the data distribution.

binjianxin
Автор

Love it!! It's called the "number line" in english. Keep up the great work

andrewcarr
Автор

Can you please make a video about SNN's and latest research on SNN's?

MrBOB-hjjq
Автор

There is this step wise generation in GAN's, not based on steps from noise to image, but based on the size of the image, like in Pro-GAN and MSG-GAN. In these models you have discriminators for different sizes of the image, kind of.

proinn
Автор

This makes me think that instead of super res from lower res image it could be even more effective to store a sparse pixel array (with high res positioning). You could even have another net 'learn' a way of choosing eg which 1000 pivels of a high res image to store (pixels providing most information for reconstruction).

JamesAwokeKnowing
Автор

Great video! I was surprised to see this after the latest paper just a fews days back! Thanks for the great explanations!

sshatabda
Автор

Any results(images) from generative models should be accompanied by the nearest neighbor(vgg latent, etc) from the training dataset. I am going to train it on mnist🏋

bgjunge
Автор

Just Amazing. I guess I might read this paper for another whole day if I missed your video. Grateful!

impromptu
Автор

Another question. If the network is predicting the noise added to a noisy image, what do you then do with that prediction? Subtract it from the noisy image? Do you then run it back through the network to again, predict noise?

When you train this network, do you train it to only predict the small amount of noise added to the image between the forward process steps? Or does it try to predict all the noise added to the image from that point?

Or maybe it's more like the forward process? Starting with latent x_T as input to the network, the network gives you an 'image' that it thinks is on the manifold (x_T-1). At this point, it most likely isn't, but, you can move 1/T towards it like we did moving towards the Gaussian noise to get to x_T. Then, repeat....?


More examples and less math always helps...

easyBob
Автор

I would say that the sqrt(1-B) is used to converge to a N(0, sigma), mainly in it's "mu", othersize adding gaussian noise would just (in expectation) have X0 as mu, instead of 0

bertobertoberto
Автор

I´ve only listened to 11 minutes so far but DDPMs remind me a lot of Compressed (or Compressive) Sensing ...

stephanebeauregard
Автор

This is me being lazy and not looking it up, but if they predict the noise instead of the image, to actually get the image they subtract the predicted noise from the noisy image iteratively until they get a clean image?

CristianGarcia
Автор

16:55 denoising depends on the entire data distribution sizes because adding random noise in one step can be done independent of all previous steps; just add a bit of noise wherever you like. But removing noise (the reverse) has to assume there was noise added in some number of previous steps. Thus, in the example of denoising a small child's drawing, it's not that we're removing ALL the noise. Instead, The dependence problem arises in simply taking a single step towards a denoised picture.

Can anyone clarify/confirm?

austin