NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)

Показать описание

VAEs have been traditionally hard to train at high resolutions and unstable when going deep with many layers. In addition, VAE samples are often more blurry and less crisp than those from GANs. This paper details all the engineering choices necessary to successfully train a deep hierarchical VAE that exhibits global consistency and astounding sharpness at high resolutions.

OUTLINE:
0:00 - Intro & Overview
1:55 - Variational Autoencoders
8:25 - Hierarchical VAE Decoder
12:45 - Output Samples
15:00 - Hierarchical VAE Encoder
17:20 - Engineering Decisions
22:10 - KL from Deltas
26:40 - Experimental Results
28:40 - Appendix
33:00 - Conclusion

Abstract:
Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels.

Authors: Arash Vahdat, Jan Kautz

Links:

Рекомендации по теме

Комментарии

As a new PhD student in this field, I can literally not thank you enough for making this content!

renehaas

You (or whoever is narrating) are really good at describing things clearly and exactly.

oreganorx

Thankyou so much for making this video! It's so hard to find friendly content when you really start digging into these topics and this is an absolute life saver!

mrcrazysalad

Great talk. Thanks for taking time to read through this. The heavier linear algebra can be a bit daunting without maths background. You help it become a bit more digestible.

johnpope

One thing missing in explanation is that you prefer a distribution over latent codes in order to get a continuous smooth latent space in which you can sample new (unseen) interpolated latent codes which are still valid ones. This provides generative power of VAE.

MrAlextorex

Awesome explanation. Great job explaining VAEs without ELBO. This is a cool and conceptually simple way of building hierarchical VAEs (unlike say BIVA which is a nightmare)

kazz

I legit expected a 'Dear fellow scholars' in the start! Lol.

mathematicalninja

Super awesome paper review! Much thanks.

alexijohansen

I was wrong - you CAN get details in (hierarchical) VAEs. I, too, was struck with the smoothness and "cutout"-like character of these faces. It seems like it handled lighting very differently, especially small level skin texture and oily skin shine. I suppose that it would have be more realistic if there had been more z layers with less upconversion at each stage, but it looked like from the number of tricks being played that NVIDIA was already struggling to make it fit in memory, and perhaps to converge.

scottmiller

We need an ACNE dataset with no smooth faces to test the true power of these generative methods

TheParkitny

I think what the decoder outputs is also a distribution instead of directly an image. The reconstruction error is then the likelihood of the input image X given this distribution of the decoder. You will have to sample to get different images by the same latent code. You don't get blurry images due to this multiple sampling, but by the uncorrelatedness assumption of the image pixels. The VQ-VAE 2 amends it by improving the prior assumption of unit normal distribution to make it learnable with an autoregressive pixelcnn model and doing the multiscale encoder/decoder pair simultaneously.

binjianxin

Can you do a video explaining normalizing flows?

rahuldeora

I wonder how VQ-VAEs compare - they are much simpler conceptually and practically and seem to address the same issues. You mentioned them briefly in the Jukebox video but they are probably worth their own video.

glennkroegel

Interesting choice using SE instead of self-attention (which proven to generate good image as well in SAGAN ), maybe it's more likely due to memory limitation? for them to choose channel attention instead of position wise attention.

rayrayray-ll

"make VAEs great again" .. hm? ;-) Impressive engineering work. Swish in the wild. Much nicer than anything I got out of my VAE models. I wonder whether decoders could be added at each level to train with auxiliary losses at each level.

bluelng

I'm wonder how hard is train this NVAE.... How long it takes to be trained when compared with stylegan2?. Their results are very good! Looks like a dream generate high quality samples without to face mode collapsing. I'm curious about the down sides.

bioinfolucas

perhaps the output images' skins are so smooth because the celebs (the training data) won't have it any other way!

napper

I just saw the images and woow! The model produces crisp images. Wondering what would be the output for videos.

kadirgunel

Regressive multistatic models in MNIST-CIFAR canvases or CIFAR-1-MNIST in the Bayesian way of encoding degression, looks like never distributed this sigma (E) from one package by the 256-byte, but 352-byte package of video encoded source data.

GGilbertProduction

I wonder if putting a gan discriminator at the end of the decoder could help remove the cartoonish look of the generated images.

darkbb

NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)

NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)

SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder

Team 40: Incorporating Supervised Learning using Classifier into NVAE

A Hierarchical VAE for Calibrating Attributes while Generating Text using Normalizing Flow -ACL 2021

Svblim3 - NVAE

Physical Layer Authentication Based on Hierarchical Variational Autoencoder for Industrial Internet

Physical Layer Authentication Based on Hierarchical Variational Autoencoder for Industrial Internet

Physical Layer Authentication Based on Hierarchical Variational Autoencoder for Industrial Internet

Physical Layer Authentication Based on Hierarchical Variational Autoencoder for Industrial Internet

beta-VAE | Lecture 62 (Part 3) | Applied Deep Learning (Supplementary)

Variational AutoEncoder Paper Walkthrough

DisVAE: Disentangled Variational Autoencoder for High-Quality Facial Expression Features

Lecture 6.3: Variational Auto-Encoders

2021 3.1 Variational inference, VAE's and normalizing flows - Rianne van den Berg

MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining | ICLR 2022

2023 2.1 VAEs and GANs - Xenia Miscouridou

Variational Autoencoders for Cancer Data Integration (Paper Explained)

Intro to ML/DL Theory. Lecture 06. Recap of VAE. Diffusion models: VDM, guidance, 3 interpretations

Calvin Luo - Understanding diffusion models: A unified perspective

Python Autoencoder - #1 - Encoder & Decoder Block

Gradient Origin Networks (Paper Explained w/ Live Coding)

MCluster VAEs an end to end Variational Deep Learning based Clustering Method for Biomarker Discov

Technical Presentation: MaGNET: Uniform Sampling from Deep Generative Network Manifolds without ...

Image decomposition in Fluorescence Microscopy: A posterior sampling based approach