Text-to-video models explained

preview_player
Показать описание
When it comes to text-to-video models, the way they create clips is very similar to the way text-to-image models create images from simple text prompts. In this episode of Hidden Layers, we take a look at how these models operate under the hood - understanding how it uses Temporal Super Resolution and Spatial Super Resolution (SSR) models to create high resolution videos from frames of images. Moreover, you’ll learn how text-to-video models - like Imagen- are
an orchestration of various models working together to produce high resolution videos from a single image and text prompt.

Resources:

Chapters:
0:00 - Intro
0:16 - What are text-to-video models?
0:34 - How do text-to-video models create videos?
1:40 - What are the complexities of modeling video?
2:04 - How do we get high-resolution videos from text-to-video models like Imagen?
3:38 - Recap of how Imagen works
3:47 - Leave us questions in the comments!

Рекомендации по теме
Комментарии
Автор

Awesome explanation- just the right lenght and good abstraction. The host did a really great job. I am subscribed now!

stefan-bayer
Автор

Thank you Laurence for making this accessible to dummies like me

sabaokangan
Автор

I would love to dive deeper into this to learn how it works!

asatorftw
Автор

The second episode of Hidden Layers, “Text to video models explained, ” maintains the same high standard as the first episode. Many thanks once again to Laurence Moroney and Google Research!

Any chance we could cover Google’s LaMDA next? Perhaps there is another breakthrough conversation model you might touch upon as well. The whole idea of RLHF (Reinforcement Learning from Human Feedback) would be a great topic to dive into.

kevinbuehler
Автор

Awesome crisp explanation ... even suitable for a high schooler & AI beginners

neelfun
Автор

Wonderful video! But why the orchestration order is two space then two time, instead of one space then one time?

xidchen
Автор

Awesome! I'm reacting to this live. I feel that these 2 hidden layer videos now beg the question: have we tried the auto regressive approach for text-to-video?

sotasearcher
Автор

REALLY AWESOME
almost Unbelievable

💯💯💯

SMASH_REVIEWS
Автор

cool supper resolution models also trained with text or just labels?

tomoki-vo
Автор

That's pretty cool, though the last few models of the upscaling and time lengthening sound very inefficient
Like, it would be much better to have a single model that upscales the video to resolution X×Y @ Z FPS

avi
Автор

What’s confusing is how do you get an image that is sensible when you denoise the image. That part I don’t quite see.

wryltxw
Автор

Does no one know the difference between "amount" and "number" anymore?

scottmiller