NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)

preview_player
Показать описание
VAEs have been traditionally hard to train at high resolutions and unstable when going deep with many layers. In addition, VAE samples are often more blurry and less crisp than those from GANs. This paper details all the engineering choices necessary to successfully train a deep hierarchical VAE that exhibits global consistency and astounding sharpness at high resolutions.

OUTLINE:
0:00 - Intro & Overview
1:55 - Variational Autoencoders
8:25 - Hierarchical VAE Decoder
12:45 - Output Samples
15:00 - Hierarchical VAE Encoder
17:20 - Engineering Decisions
22:10 - KL from Deltas
26:40 - Experimental Results
28:40 - Appendix
33:00 - Conclusion

Abstract:
Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels.

Authors: Arash Vahdat, Jan Kautz

Links:
Рекомендации по теме
Комментарии
Автор

As a new PhD student in this field, I can literally not thank you enough for making this content!

renehaas
Автор

You (or whoever is narrating) are really good at describing things clearly and exactly.

oreganorx
Автор

Thankyou so much for making this video! It's so hard to find friendly content when you really start digging into these topics and this is an absolute life saver!

mrcrazysalad
Автор

Great talk. Thanks for taking time to read through this. The heavier linear algebra can be a bit daunting without maths background. You help it become a bit more digestible.

johnpope
Автор

One thing missing in explanation is that you prefer a distribution over latent codes in order to get a continuous smooth latent space in which you can sample new (unseen) interpolated latent codes which are still valid ones. This provides generative power of VAE.

MrAlextorex
Автор

Awesome explanation. Great job explaining VAEs without ELBO. This is a cool and conceptually simple way of building hierarchical VAEs (unlike say BIVA which is a nightmare)

kazz
Автор

I legit expected a 'Dear fellow scholars' in the start! Lol.

mathematicalninja
Автор

Super awesome paper review! Much thanks.

alexijohansen
Автор

I was wrong - you CAN get details in (hierarchical) VAEs. I, too, was struck with the smoothness and "cutout"-like character of these faces. It seems like it handled lighting very differently, especially small level skin texture and oily skin shine. I suppose that it would have be more realistic if there had been more z layers with less upconversion at each stage, but it looked like from the number of tricks being played that NVIDIA was already struggling to make it fit in memory, and perhaps to converge.

scottmiller
Автор

We need an ACNE dataset with no smooth faces to test the true power of these generative methods

TheParkitny
Автор

I think what the decoder outputs is also a distribution instead of directly an image. The reconstruction error is then the likelihood of the input image X given this distribution of the decoder. You will have to sample to get different images by the same latent code. You don't get blurry images due to this multiple sampling, but by the uncorrelatedness assumption of the image pixels. The VQ-VAE 2 amends it by improving the prior assumption of unit normal distribution to make it learnable with an autoregressive pixelcnn model and doing the multiscale encoder/decoder pair simultaneously.

binjianxin
Автор

Can you do a video explaining normalizing flows?

rahuldeora
Автор

I wonder how VQ-VAEs compare - they are much simpler conceptually and practically and seem to address the same issues. You mentioned them briefly in the Jukebox video but they are probably worth their own video.

glennkroegel
Автор

Interesting choice using SE instead of self-attention (which proven to generate good image as well in SAGAN ), maybe it's more likely due to memory limitation? for them to choose channel attention instead of position wise attention.

rayrayray-ll
Автор

"make VAEs great again" .. hm? ;-) Impressive engineering work. Swish in the wild. Much nicer than anything I got out of my VAE models. I wonder whether decoders could be added at each level to train with auxiliary losses at each level.

bluelng
Автор

I'm wonder how hard is train this NVAE.... How long it takes to be trained when compared with stylegan2?. Their results are very good! Looks like a dream generate high quality samples without to face mode collapsing. I'm curious about the down sides.

bioinfolucas
Автор

perhaps the output images' skins are so smooth because the celebs (the training data) won't have it any other way!

napper
Автор

I just saw the images and woow! The model produces crisp images. Wondering what would be the output for videos.

kadirgunel
Автор

Regressive multistatic models in MNIST-CIFAR canvases or CIFAR-1-MNIST in the Bayesian way of encoding degression, looks like never distributed this sigma (E) from one package by the 256-byte, but 352-byte package of video encoded source data.

GGilbertProduction
Автор

I wonder if putting a gan discriminator at the end of the decoder could help remove the cartoonish look of the generated images.

darkbb
welcome to shbcf.ru