MIT 6.S191 (2023): Text-to-Image Generation

preview_player
Показать описание
MIT Introduction to Deep Learning 6.S191: Lecture 8
Deep Learning Limitations and New Frontiers
Lecturer: Dilip Krishnan
2023 Edition

Lecture Outline - coming soon!

Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!
Рекомендации по теме
Комментарии
Автор

I appreciate this series, and am grateful for it. Therefore some minor feedback on this lecture:
Unlike the others this one feels less like being taught,
more like he's presenting his research / work.

MrMMF
Автор

I'm tired of image generators creating cat images (I have two, don't need to see anymore)

What I would really love to see are technically accurate diagrams, drawings and other visualizations. 2D, 3D and later video, and at a user specified level of detail.

It seems nobody is doing that.

Imagine asking AI to visualize how a particular ML model process data, and seeing a video of it in action.

tomski
Автор

this is an "intro" class, however it is attended by experts 😅

mariuspy
Автор

I was looking forward to this lecture. It did not disappoint.

chyldstudios
Автор

🎯Course outline for quick navigation:

[00:09-05:11]1. Text-to-image generation with muse model
-[01:07-02:21]Text enables non-experts to create compelling images, leveraging large-scale data and pre-trained language models.
-[04:52-05:22]Muse is faster, generating a 512x512 image in 1.3 seconds compared to 10 seconds for image in prior models and about 4 seconds for stable diffusion.

[05:11-17:23]2. Efficient image generation model and token-based super resolution with muse
-[05:11-06:02]512x512 image generated in 1.3s, outperforming stable diffusion by 6.7s, with high clip score and fid, indicating fast and high-quality performance.
-[06:28-07:01]Muse uses transformer-based architecture for text and image, employing cnns, vector quantization, and gan in the modern deep network toolbox.
-[07:21-07:56]Two models: base generates 256x256 images, super resolution upscales to 512x512. t5 xxl model with 5b parameters used.
-[10:47-11:12]Variable distribution drops 64% tokens, enhancing network for editing apps and allowing masks of different sizes.
-[16:33-16:58]Using classifier-free guidance to generate scenes without specific objects.

[17:24-22:48]3. Image generation and model evaluation
-[17:24-18:00]Iterative decoding improves image generation with up to 18 steps.
-[18:57-19:29]Raters preferred our model 70% of the time over stable diffusion (25%).
-[21:21-22:48]Text-to-image model evaluation and challenges in mapping concepts to pixels.

[22:48-28:40]4. Evaluating dall-e2 and imagen models, text-guided image editing, and style transfer
-[22:48-25:10]Dall-e2 model excels in fid and clip scores, with a runtime of 1.3 seconds for super resolution.
-[26:15-27:42]Style transfer and mask-free editing demonstrated with image examples, showcasing the model's ability to make various changes based on text and attention between text and images.
-[28:18-28:40]Cake and latte morph into croissant, latte art changes from heart to flower.

[28:43-36:08]5. Interactive editing and parallel decoding
-[28:43-30:46]Model enables real-time interactive editing, with focus on improving resolution and text-image interaction.
-[31:06-31:56]Research focuses on speeding up neural network processing, aiming for parallel decoding and high confidence token use.
-[33:27-34:07]Editing suggests smoothness, but model fails with more than 6-7 of the same item.
-[35:26-35:55]The model generates random scenes dominated by backgrounds like mountains and beaches when fed with nonsense text prompts.

[36:08-44:32]6. Image editing and ai generation
-[36:55-37:36]Editing process involves small back prop steps, may need faster progression
-[38:31-38:57]Realistic pose changes are harder than global style changes due to increased token interaction.
-[39:36-40:10]Using random seeds, we generate images; 3-4 out of 8-16 are nice. no automated way to match image to prompt.
-[41:31-42:22]Data biased towards famous artists, requires fine tuning for new styles.
-[44:06-44:32]Training hundreds of millions of images to identify new combinations of concepts, likely not memorization.

offered by Coursnap

bohaning
Автор

I would love to see the MUSE model generating images of X-rays or CT scans for the curable and non-curable brain tumours. And getting the accuracy of those images vetted by at least 1000 brain cancer surgeons.

rahulsingh
Автор

Hello! Where is the slide of this lecture? There is no link in the website

dianzhang
Автор

Thanks a lot ❤❤ I love to see your lectures ..

phyrajkumarverma
Автор

I'ts very interesting! Thank you for that!

nataliameira
Автор

Hello! Where is the slide of this lecture? There is no link in the website 🥲

teron-
Автор

Is MUSE related to Diffusion Transformer? thx

AndyKong
Автор

Hello! Where is the slide of this lecture? There is no link in the website

teron-
Автор

Expected a lecture on Diffusion Models too.

abhishek-tandon