Imagen, the DALL-E 2 competitor from Google Brain, explained 🧠| Diffusion models illustrated

preview_player
Показать описание
Imagen from Google Brain 🧠 is competing with DALLE-2 when it comes to generating amazing images from just text! Here is an overview of Imagen, DALLE-2 and GLIDE, which are all diffusion-based text-to-image generators.

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏

▶ Outline:
00:00 Generating images from text
00:40 Weights&Biases (Sponsor)
01:40 A brief history of text-to-image generators
04:44 How does Imagen work?
06:36 Classifier-free guidance
07:55 What is Imagen good at? & DrawBench

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​

Music 🎵 : Til I Hear'em Say (Instrumental) - NEFFEX
Рекомендации по теме
Комментарии
Автор

Video after video. I don't understand how "only" 16.2k people follow you. I'll share this diamond. Thank you so much.

mizupof
Автор

4:20 We don't want to minimize the work done here or the achievements, we just want to emphasize how the naming of things and the right introduction to the public... !!!
I don't want to minimize value of this revue but personally for me this observation is much more valuable than technicalities of all of those papers (because I don't use them directly in my work)

harumambaru
Автор

After having played with Dall-E 2 for a few weeks, there are a few things that strike me. 1) This is a huge advancement in coherent image generation, most of the resulting images feel "good" in the sense that previously I couldn't say. 2) That is going to impact the creative industries hard 3) Having a corporation gatekeep these models is hugely problematic. They don't allow you to prompt for whatever you like, they assert copyright etc. 4) We need an open source freely available version, not least because being "allowed" access is too tenuous for artistic use where you might break terms of conditions. 6) We need better methods of guiding. DallE-2 has "variations" which helps, but I want to tell the model what I like in parts of generated images and have it understand the latents that produced those outputs specifically.

So overall, this is a really exciting new set of models. I agree with you about the photorealism part, I often switch between them for more creative outputs. Truly revolutionary!

zoombapup
Автор

thanks for these videos, watching the whole series

samanthaqiu
Автор

omg FINALLY I understand "classifier-free guidance" after watching this. I really dislike whoever came up with this name, it should be called "classifier-overweighted guidance" or something because it's not at all 'classifier-free', is it?

great vid as always!

liam
Автор

Your channel is a hidden gem! Thanks uploading DL content. 😁👍

bryanpiguave
Автор

Excellent video with a rich and detailed discussion on comparing DALLE-2 and Imagen! I'm not sure if Imagen's results are better than the DALLE-2 results, although the authors of the former gave a decent argument using the DrawBrench benchmark. I thought they didn't want human faces due to copyrights and legal issues, but the biases issues you mentioned seem to be the correct answer.

bpmsilva
Автор

Would you mind introducing the social medias that discuss the latest hot papers by facebook and google?

deeper_learning
Автор

Hi! I very much enjoy your videos. Thank you, if you do requests I would greatly appreciate an explanation on flowbased models as have been trying to wrap my head around it for a while and your educational abilities are excellent.

jessedeng
Автор

A video on LDM / Stable Diffusion would be great, especially since it's not as prominent in the public as DALL-E 2

conan_der_barbar
Автор

From their paper: "On the set with no people, there is a boost in preference rate of Imagen to 43.6%,
indicating Imagen’s limited ability to generate photorealistic people." - so they don't show people examples because they're probably not great and/or likely scary looking :)

alexandrupapiu
Автор

T5 does not use a CLS token. What is the CLS token you refer to?

Skinishh
Автор

Is Miss coffee Bean controlled by you and editing? Or is it auto selected based off of emotion detection from a language model?

GRAMBOStudio
Автор

To allow more creativity just apply a Unix approach; pipe in text from yet another program that allows a _laissez faire_ attitude and can play _fast and loose_ with said text. Extending the avocado chair we might input, "a <fruit> chair" where we define the <fruit> set within the text program and the image generator pumps out a load of fruit chairs. Let's get Dr. Pârcălăbescu seated and comfortable, she is a VIP.

❤ te iubim, Romania ❤

johnvonhorn
Автор

People are more critical to LLMs than they are to other people.

It's absolutely normal for people to mis-hear or misinterpret something. People ask clarifying questions all the time.

But if a model misinterprets something, it's considered a limitation, "lack of understanding" or a failure.

Don't forget that a model is required to produce an answer instantly, it's not allowed to take a pause and think, or ask clarifying question, etc.

I'd argue that drawing a "a horse riding an astronaut" as a horse-riding astronaut is actually correct in these circumstances. There's likely non-zero amount of noise and non-grammatical language use in training data. So it is reasonable to interpret it as a possible error if person-riding-a-horse is extremely more likely than horse-riding-a-person than to draw what is requested literally.

In that case it might be possible to solve it with prompt like "Draw this precisely".

killers
Автор

I just realized you are the author of VALSE and SEE PAST WORDS. i am also interested in probing vlp models and your papers inspire me a lot. but i have always wondered what is the reason of the phenomenon discovered in VALSE🤣🤣🤣 why these models cannot distinguish those small but meaning-changing modifications in language🤪🤪🤪

mianzhipan
Автор

I’m so impressed that Imagen can produce coherent text. Dalle-2’s gibberish text is cringeworthy (and sometimes hilarious).

hayvenforpeace
Автор

The revolution will come through when explicit context will be enabled to drive these models (just my humble opinion)

yolgezerisvicrede