DALL-E 3 is better at following Text Prompts! Here is why. — DALL-E 3 explained

preview_player
Показать описание
Synthetic captions help DALL-E 3 follow text prompts better than DALL-E 2. We explain how OpenAI innovates the training of diffusion models with better image captions.

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, Mutual Information, Kshitij

Outline:
00:00 DALLE-3
00:41 Gradient (Sponsor)
01:50 Timeline of image generation
03:34 Recaptioning with synthetic captions
04:36 Creating the synthetic captions
05:19 How well does it work?

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
Music 🎵 : 368 - Dyalla

Video editing: Nils Trost
Рекомендации по теме
Комментарии
Автор

This channel always ask question I never knew I wanted the answers to but yes

sadface
Автор

I love this video. So clear and you put the questions just right. DALL-E 3 is really good.

elinetshaaf
Автор

Dall-E3 is really a lot better than the previous iterations. I was wondering why that was. This makes a lot of sense, thanks for the explanation!

DerPylz
Автор

03:10 🏗 DALL-E 3 is a Latent Diffusion model based on a U-Net and uses T5 XXL for encoding the text prompt. Further technical details are undisclosed.
04:04 📝 DALL-E 3 addresses the issue of missing details in prompts by generating synthetic captions for training data, which are more detailed than the original alt-text captions.
05:25 🔄 The training data for DALL-E 3 consists of 95% synthetic captions and 5% actual human-written captions, resulting in images that are preferred by human annotators.
06:24 📊 The similarity between generated images and captions is measured using CLIP scores, showing that the model trained on synthetic captions performs well compared to models trained on human-written captions.
06:52 🧐 Google Research also published a similar idea of using synthetic captioning, indicating that it's a promising approach for improving model performance. The potential drawbacks and limitations are discussed.

dameanvil
Автор

Thanks. Great video. Love how you covered the subject, but also your running commentary about for example the lack of showing what happened at 98% or the lack of disclosure of more details.

Interesting that both labs use stable diffusion for research.

Also, wondering about why OAI moved from the transformer architecture to diffusion as you pointed out? Didn’t google also release the muse model based on transformer? I understand that you have a day job, but may I still hint at a muse video? 😊

mkamp
Автор

i really enjoy your videos. very insightful and light. keep em coming.

marklopez
Автор

What about partly training CLIP itself on synthetic data?
Then use the new CLIP embeddings on Diffusion models for even better results.

Synthetic data all the way down.

Xaelum
Автор

What about the fact that it's near perfect with text? It had to have been targeted finetuning right, the same way midjourney fixed hands?

donesitackacom
Автор

So anyone here know where to get some descriptive caption and image datasets?

Edit: not must be OpenAI, other alternatives or models are fine

dinhanhx
Автор

Hey, where was the paper that shows the catastrophic degradation of models when there is an error introduced by synthetic data?

mattk
Автор

How is this synthedic data generation done? What I would do is detect all the objects in the scene and where they are and from that let a LLM describe the scene without seeing it. Not sure how GPT-Vision exactly works and how it understands images and I also dont not know if they used GPT-V to create these discriptions for the images. Does anyone know?

VR_Wizard
Автор

Your videos are rocking, as always. Hey, do you have any remote internship opportunities in your team or in your organisation? I would love to learn and work with you guys.

TemporaryForstudy