Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

Показать описание

Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression.

If you want to dive in even more into Würstchen here is the link to the paper & code:

We also created a community Discord for people interested in Generate AI:

Рекомендации по теме

Комментарии

You are definitely the most detailed and understandable person I have ever seen.

xiaolongye-yg

Thank you for your wonderful explanation. Yes, I am very interested in learning about diffusion models, especially text to image.

SaraKangazian

Super nice video which explains the architecture behind Stable Cascade. Step B was nicely visualized, but I still need a bit more time to fully grasp it. Well done!

dbender

Hi! If it's not a secret, where do you get datasets for training Text2img models? Very great video!

NedoAnimations

This is great, models that are intuitive to understand are the bests ones I find. Great job of explaining it as well.

jeanbedry

awesoooom !!! I always wait your videos

dbssus

Damn, you're so smart thanks for explaining this to us. I hope you'll make millions of dollars.

hayhay_to

Love this - what a fantastic achievement!

EvanSpades

Amazing work. I am wondering, how this video was made i.e. Editing Process and Cool Animations

eswardivi

Can you please tell where did you study entirety of ML/Deep learning? (courses?)

flakky

Why use a second encoder? Isn't that what VQGan is supposed to do?

TheAero

I love the video. :) and i would love more detail 😮😮😮😮

timeTegus

Amazing job and I really love the idea of reducing the size of the models, since it’s just make so much sense for me!! I have a small question, what gpus did you use for training? Did you use a cloud provider for that or you have your own local station? If the second I’m interested to know which hardware components you have? Just curious because I’m trying to make a decision between using cloud providers for training vs buying a local station 😊

mohammadaljumaa

Hello,
How do I make a Seamless Pattern with Würstchen, I try a few prompts, the edges are always problematic.

digiministrator

At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you

KienLe-mdyv

Did you use the same dataset with SDXL?

JT-hgmj

Hi Dominic,
This is some great work you have accomplished and definitely a step in the right right direction of democratizing the diffusion method.

I have some questions, and a little bit of critique if that would be okay.

You say you achieve a compression rate of 42x, however, is this a fair statement when that vector is never decompressed into an actual image?
It looks more like your Stage C can create some sort of feature vectors of images in very low dimensional space using the text descriptions. Which then are used to guide the actual image creation, along with embedded text in stage B.

In my opinion it looks more like you have used stage C to learn a feature vector representation for the image, which is used as a condition similar to how language free text-to-image models might use the image itself to guide in training.

However, I don't believe this to be a 42x image compression without the decompression. Have you tried connecting a decoder onto vectors coming out of stage C?
(I would believe the that vector might not be big enough to create a high resolution images because of it's dimensional size)

I hope you can answer some of my questions or clear up any misunderstandings on my part.
I'm currently doing my thesis on fast diffusion models and found your concept of extreme compression very compelling. Directions on where to go next regarding this topic is also very much appreciated :)

Best of luck with further research.

jollokim

Awesome! I have a question: who decided to call it "Würstchen" and why? I am German and just wondering

hipy-tzqt

How do you create the dynamic videos of NNs? I want to create a YouTube Channel explaining theory&code in Spanish. Best regards.

saulcanoortiz

what about decompression times? are they faster and would it be less resources on older systems.
curious if the models from this wound benefit users, IE most still use 1.5 nd v2 models of SD due to the decompression times of SDXL models taking so long.

streamtabulous

Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

Würstchen New Text-To-Image Diffusion Model 🏎️🌭 #ai #aitools #artificialintelligence #ainews...

On Training a Stable Diffusion Model Trial and Error

Evaluating Data Attribution for Text-to-Image Models. ICCV 23. [SIGGRAPH Frontiers Talk Excerpt]

Stable Diffusion! Text 2 Image AI

Stable Cascade: A Game-Changer 🚀 OR Just Hype? See for Yourself!

Testing The New AI Text To Image ' Wuerstchen ' On Huggingface with Some Old Prompts

MIT 6.S191 (2023): Text-to-Image Generation

We had Image Gen copying LLM... and now the REVERSE?? [DiffusionLM]

Würstchen New Text To Image Diffusion Model Release - Faster And Low VRAM Needed!

What Is Stable Diffusion? / How Does Text to Image Work? A Tech Talk Simplified Series Video

Faster Diffusion Model Wuerstchen

Nvidia’s Sana AI: Create 4K Images on Any Device—No Expensive GPU Needed! I AI FunTech

Text to Image using Deep Learning

The AI Cold War: Kandinsky 2.2 and SDXL's Text-to-Image Showdown

Run Stable Diffusion (A Deep Learning, AI Model) To Generate Realistic Images In Colab For Free

Stable Diffusion 2.1 Training | CRIMEBOOTH

Stable Diffusion Tutorial: Train Use Your Face In AI Images!

Stable Cascade (Würstchen) Diffusion Model: Image Generation Accelerated

SCSE-GSC | SLS Talk #20 | Text2Human: Text-Driven Controllable Human Image Generation | Yuming Jiang

Skimming through hella new AI papers - Sept 6, 2024

Online Class Studio Setup I Digital Board II With 4k PTZ camera #shorts

AI Video Magic - You Can Set Yourself FREE And Save $1,000s! 🔥

Stable Diffusion Animation #shorts