Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Показать описание

00:00 Intro
00:53 DiTs
04:06 Autoregressive Image Transformers
06:23 Tokenization problem with AR ViTs
08:43 VAE
10:47 Discrete Quantization - VQGAN
16:42 Visual Autoregressive Modeling
21:31 Causal Inference with VAR
24:02 Losses
25:16 Residual Modeling
33:26 Summary
34:11 Results

Рекомендации по теме

Комментарии

Thank you for your interpretation. It help me better understand this paper. Btw, could you share which PDF annotation app you’re using on your iPad? It looks quite handy.😆

peppapopoAA

They didn't discuss a proof of linear scaling with size, only generation time. My guess is that their linear scaling comes from training the VQVAE in tandem with the generation model, which DiT does not do. The frozen VAE sets a minimum limit for the FID score, which I believe for ImageNet 256x256 is somewhere around 1.4, and would require perfect latents. Pixel-space models wouldn't have that issue though, but are much more expensive to train and run.
That aside, VAR is a clever idea and the generation speed is impressive - I do wonder if they could achieve better results (perhaps with smaller models) if they combined the idea with MaskGiT. It would be a little slower (although a smaller model could make up for that), but it would allow for a self-correction step.

hjups

Thanks, looking for some new generative structures recently for reseach these days

fusionlee

I didn't quite understand what the quantization part is achieving. Does it serve only as the tokenization step? Or does it contribute to reducing the picture's resolution somehow?

Ali-wfef

Isnt this basically a Diffusion model but instead of noising they do blurring (through downsampling) and try to revert the blur (instead of the noise in DM). And the vector quantitzation is similar to the one from stable diffusion as far as I understand. But how does it compare to the general concept of scores matching?

xplained

Where are the actual learnable NN parameters in Algorithm 1 and 2? In the interpolation step? Also, you depict r1, r2 and so on as sets of 1, 2, an so on tokens, but shouldnt it be the square (1, 4, and so on)?

lawrencephillips

I don't get how they actually generate tokens with their transformer. In a normal transformer, your end result is each token predicting the next token. But here its each token predicting what? An entire high resolution image? And how is this happening in parallel? Is each token in the next resolution not attening to its own tokens, how does each token in the next resolution know what position it's in? Is it each token in the previous resolution predicting its 4 closest patches in the next resolution? This architecture is nothing like a traditional transformer and they spend no time explaining the difference. Very poorly written paper in my opinion, however good their benchmarks may be.

zacklee

This paper literally makes no sense. The whole point of autoregressive modeling is to independently sample one token at a time, and condition future tokens on past tokens. This method breaks all that by having the model sample hundreds (perhaps thousands) of tokens independently in one inference step, despite all tokens being in one large joint probability space. The only way this works is if you overtrain your model to such an absurd degree that there is basically no ambiguity anywhere in your generation step, and then you can simply take the argmax over every single logit and have it work out.

And, if you look at their methodology, that's exactly what you find. This model was trained for **350** epochs over the same data. That is **absurd**. So yeah, don't expect this method to work unless you wildly overtrain it. It has some good ideas (e.g. hierarchical generation), but the rest of its claims are dubious at best.

marinepower

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Visual Autoregressive Modeling (VAR): Scalable Image Generation #bytedance

The Future of Image Generation: Inside Visual Autoregressive Modeling VAR

Visual Autoregressive Modeling

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Visual Autoregressive Modeling Scalable Image Generation via Next Scale PredictionPKU & Bytedan...

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (Paper Walkthru)

Autoregressive Image Generation without Vector Quantization

VAR (Visual AutoRegressive) Transformers Model - New Way of Generating Images - Install Locally

Why Does Diffusion Work Better than Auto-Regression?

Scalable Autoregressive Image Generation with Mamba

Parti - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Paper Explained)

[QA] Scalable Autoregressive Image Generation with Mamba

[QA] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

[CVPR 2023] Wavelet Diffusion Models Are Fast and Scalable Image Generators

[CVPR 2023 Highlight presentation] Autoregressive Image Generation with Dynamic Vector Quantization

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

iVideoGPT - Interactive Scalable World Model

Synthesizing Coherent Story With Auto-Regressive Latent Diffusion Models

Autoregressive Diffusion Models (Machine Learning Research Paper Explained)

OpenAI CLIP: ConnectingText and Images (Paper Explained)

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

How I Understand Flow Matching