MIT 6.S191 (2023): Text-to-Image Generation

Показать описание

MIT Introduction to Deep Learning 6.S191: Lecture 8
Deep Learning Limitations and New Frontiers
Lecturer: Dilip Krishnan
2023 Edition

Lecture Outline - coming soon!

Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!

Рекомендации по теме

Комментарии

I appreciate this series, and am grateful for it. Therefore some minor feedback on this lecture:
Unlike the others this one feels less like being taught,
more like he's presenting his research / work.

MrMMF

I'm tired of image generators creating cat images (I have two, don't need to see anymore)

What I would really love to see are technically accurate diagrams, drawings and other visualizations. 2D, 3D and later video, and at a user specified level of detail.

It seems nobody is doing that.

Imagine asking AI to visualize how a particular ML model process data, and seeing a video of it in action.

tomski

this is an "intro" class, however it is attended by experts 😅

mariuspy

I was looking forward to this lecture. It did not disappoint.

chyldstudios

🎯Course outline for quick navigation:

[00:09-05:11]1. Text-to-image generation with muse model
-[01:07-02:21]Text enables non-experts to create compelling images, leveraging large-scale data and pre-trained language models.
-[04:52-05:22]Muse is faster, generating a 512x512 image in 1.3 seconds compared to 10 seconds for image in prior models and about 4 seconds for stable diffusion.

[05:11-17:23]2. Efficient image generation model and token-based super resolution with muse
-[05:11-06:02]512x512 image generated in 1.3s, outperforming stable diffusion by 6.7s, with high clip score and fid, indicating fast and high-quality performance.
-[06:28-07:01]Muse uses transformer-based architecture for text and image, employing cnns, vector quantization, and gan in the modern deep network toolbox.
-[07:21-07:56]Two models: base generates 256x256 images, super resolution upscales to 512x512. t5 xxl model with 5b parameters used.
-[10:47-11:12]Variable distribution drops 64% tokens, enhancing network for editing apps and allowing masks of different sizes.
-[16:33-16:58]Using classifier-free guidance to generate scenes without specific objects.

[17:24-22:48]3. Image generation and model evaluation
-[17:24-18:00]Iterative decoding improves image generation with up to 18 steps.
-[18:57-19:29]Raters preferred our model 70% of the time over stable diffusion (25%).
-[21:21-22:48]Text-to-image model evaluation and challenges in mapping concepts to pixels.

[22:48-28:40]4. Evaluating dall-e2 and imagen models, text-guided image editing, and style transfer
-[22:48-25:10]Dall-e2 model excels in fid and clip scores, with a runtime of 1.3 seconds for super resolution.
-[26:15-27:42]Style transfer and mask-free editing demonstrated with image examples, showcasing the model's ability to make various changes based on text and attention between text and images.
-[28:18-28:40]Cake and latte morph into croissant, latte art changes from heart to flower.

[28:43-36:08]5. Interactive editing and parallel decoding
-[28:43-30:46]Model enables real-time interactive editing, with focus on improving resolution and text-image interaction.
-[31:06-31:56]Research focuses on speeding up neural network processing, aiming for parallel decoding and high confidence token use.
-[33:27-34:07]Editing suggests smoothness, but model fails with more than 6-7 of the same item.
-[35:26-35:55]The model generates random scenes dominated by backgrounds like mountains and beaches when fed with nonsense text prompts.

[36:08-44:32]6. Image editing and ai generation
-[36:55-37:36]Editing process involves small back prop steps, may need faster progression
-[38:31-38:57]Realistic pose changes are harder than global style changes due to increased token interaction.
-[39:36-40:10]Using random seeds, we generate images; 3-4 out of 8-16 are nice. no automated way to match image to prompt.
-[41:31-42:22]Data biased towards famous artists, requires fine tuning for new styles.
-[44:06-44:32]Training hundreds of millions of images to identify new combinations of concepts, likely not memorization.

offered by Coursnap

bohaning

I would love to see the MUSE model generating images of X-rays or CT scans for the curable and non-curable brain tumours. And getting the accuracy of those images vetted by at least 1000 brain cancer surgeons.

rahulsingh