Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

preview_player
Показать описание
Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression.

If you want to dive in even more into Würstchen here is the link to the paper & code:

We also created a community Discord for people interested in Generate AI:
Рекомендации по теме
Комментарии
Автор

You are definitely the most detailed and understandable person I have ever seen.

xiaolongye-yg
Автор

Thank you for your wonderful explanation. Yes, I am very interested in learning about diffusion models, especially text to image.

SaraKangazian
Автор

Super nice video which explains the architecture behind Stable Cascade. Step B was nicely visualized, but I still need a bit more time to fully grasp it. Well done!

dbender
Автор

Hi! If it's not a secret, where do you get datasets for training Text2img models? Very great video!

NedoAnimations
Автор

This is great, models that are intuitive to understand are the bests ones I find. Great job of explaining it as well.

jeanbedry
Автор

awesoooom !!! I always wait your videos

dbssus
Автор

Damn, you're so smart thanks for explaining this to us. I hope you'll make millions of dollars.

hayhay_to
Автор

Love this - what a fantastic achievement!

EvanSpades
Автор

Amazing work. I am wondering, how this video was made i.e. Editing Process and Cool Animations

eswardivi
Автор

Can you please tell where did you study entirety of ML/Deep learning? (courses?)

flakky
Автор

Why use a second encoder? Isn't that what VQGan is supposed to do?

TheAero
Автор

I love the video. :) and i would love more detail 😮😮😮😮

timeTegus
Автор

Amazing job and I really love the idea of reducing the size of the models, since it’s just make so much sense for me!! I have a small question, what gpus did you use for training? Did you use a cloud provider for that or you have your own local station? If the second I’m interested to know which hardware components you have? Just curious because I’m trying to make a decision between using cloud providers for training vs buying a local station 😊

mohammadaljumaa
Автор

Hello,
How do I make a Seamless Pattern with Würstchen, I try a few prompts, the edges are always problematic.

digiministrator
Автор

At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you

KienLe-mdyv
Автор

Did you use the same dataset with SDXL?

JT-hgmj
Автор

Hi Dominic,
This is some great work you have accomplished and definitely a step in the right right direction of democratizing the diffusion method.

I have some questions, and a little bit of critique if that would be okay.

You say you achieve a compression rate of 42x, however, is this a fair statement when that vector is never decompressed into an actual image?
It looks more like your Stage C can create some sort of feature vectors of images in very low dimensional space using the text descriptions. Which then are used to guide the actual image creation, along with embedded text in stage B.

In my opinion it looks more like you have used stage C to learn a feature vector representation for the image, which is used as a condition similar to how language free text-to-image models might use the image itself to guide in training.

However, I don't believe this to be a 42x image compression without the decompression. Have you tried connecting a decoder onto vectors coming out of stage C?
(I would believe the that vector might not be big enough to create a high resolution images because of it's dimensional size)

I hope you can answer some of my questions or clear up any misunderstandings on my part.
I'm currently doing my thesis on fast diffusion models and found your concept of extreme compression very compelling. Directions on where to go next regarding this topic is also very much appreciated :)

Best of luck with further research.

jollokim
Автор

Awesome! I have a question: who decided to call it "Würstchen" and why? I am German and just wondering

hipy-tzqt
Автор

How do you create the dynamic videos of NNs? I want to create a YouTube Channel explaining theory&code in Spanish. Best regards.

saulcanoortiz
Автор

what about decompression times? are they faster and would it be less resources on older systems.
curious if the models from this wound benefit users, IE most still use 1.5 nd v2 models of SD due to the decompression times of SDXL models taking so long.

streamtabulous