Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

preview_player
Показать описание


00:00 Intro
00:53 DiTs
04:06 Autoregressive Image Transformers
06:23 Tokenization problem with AR ViTs
08:43 VAE
10:47 Discrete Quantization - VQGAN
16:42 Visual Autoregressive Modeling
21:31 Causal Inference with VAR
24:02 Losses
25:16 Residual Modeling
33:26 Summary
34:11 Results
Рекомендации по теме
Комментарии
Автор

Thank you for your interpretation. It help me better understand this paper. Btw, could you share which PDF annotation app you’re using on your iPad? It looks quite handy.😆

peppapopoAA
Автор

They didn't discuss a proof of linear scaling with size, only generation time. My guess is that their linear scaling comes from training the VQVAE in tandem with the generation model, which DiT does not do. The frozen VAE sets a minimum limit for the FID score, which I believe for ImageNet 256x256 is somewhere around 1.4, and would require perfect latents. Pixel-space models wouldn't have that issue though, but are much more expensive to train and run.
That aside, VAR is a clever idea and the generation speed is impressive - I do wonder if they could achieve better results (perhaps with smaller models) if they combined the idea with MaskGiT. It would be a little slower (although a smaller model could make up for that), but it would allow for a self-correction step.

hjups
Автор

Thanks, looking for some new generative structures recently for reseach these days

fusionlee
Автор

I didn't quite understand what the quantization part is achieving. Does it serve only as the tokenization step? Or does it contribute to reducing the picture's resolution somehow?

Ali-wfef
Автор

Isnt this basically a Diffusion model but instead of noising they do blurring (through downsampling) and try to revert the blur (instead of the noise in DM). And the vector quantitzation is similar to the one from stable diffusion as far as I understand. But how does it compare to the general concept of scores matching?

xplained
Автор

Where are the actual learnable NN parameters in Algorithm 1 and 2? In the interpolation step? Also, you depict r1, r2 and so on as sets of 1, 2, an so on tokens, but shouldnt it be the square (1, 4, and so on)?

lawrencephillips
Автор

I don't get how they actually generate tokens with their transformer. In a normal transformer, your end result is each token predicting the next token. But here its each token predicting what? An entire high resolution image? And how is this happening in parallel? Is each token in the next resolution not attening to its own tokens, how does each token in the next resolution know what position it's in? Is it each token in the previous resolution predicting its 4 closest patches in the next resolution? This architecture is nothing like a traditional transformer and they spend no time explaining the difference. Very poorly written paper in my opinion, however good their benchmarks may be.

zacklee
Автор

This paper literally makes no sense. The whole point of autoregressive modeling is to independently sample one token at a time, and condition future tokens on past tokens. This method breaks all that by having the model sample hundreds (perhaps thousands) of tokens independently in one inference step, despite all tokens being in one large joint probability space. The only way this works is if you overtrain your model to such an absurd degree that there is basically no ambiguity anywhere in your generation step, and then you can simply take the argmax over every single logit and have it work out.

And, if you look at their methodology, that's exactly what you find. This model was trained for **350** epochs over the same data. That is **absurd**. So yeah, don't expect this method to work unless you wildly overtrain it. It has some good ideas (e.g. hierarchical generation), but the rest of its claims are dubious at best.

marinepower