AI Art Explained: How AI Generates Images (Stable Diffusion, Midjourney, and DALLE)

preview_player
Показать описание
Learn how AI image generation works. This video goes over the AI components of AI image generation models like Stable Diffusion and explains how they work and how they're trained.

---

--

Introduction (0:00)
Text-to-image and image-to-image (1:32)
The components of Stable Diffusion - high-level overview (3:06)
The three models inside the AI Image Generator (5:48)
Generating images with reverse diffusion (8:36)
Images emerging from noise (11:09)
How the model is trained. 1 - Diffusion (12:46)
How the model is trained. 2 - Compression (17:44)
The importance of language models for image generation (20:43)
How CLIP is trained (training on both text and images) (22:55)
Guiding image generation with text prompts (25:57)
Conclusion (28:07)
Рекомендации по теме
Комментарии
Автор

You have a very unique way of explaining deep learning concepts. The illustrations are very concise and to the point which really helps focus on the core concepts and not get distracted by technical details. Thanks for making this great video!

omidsajedi
Автор

Thank you Sir for sharing ..your explanation is always different ..from transformer architecture i am following you..great

anupamsaha
Автор

As a visual thinker, SD can be quite overwhelming under the hood. I have been using the graphical interface "Comfyui" and it has taken me quite a distance in understanding the dynamics of SD. Your video and page helped me a lot in taking the next step to the more advanced features and expanding my options. Thanks Jay!

Tjeminee
Автор

Very practical and useful information. Thanks!

laostalk
Автор

Thanks Jay! just like your NLP Transformer series which still stands tall with the test of time.., one more added to the my list of go--to reference.! you are indeed a master in the art of teaching!!

trajesh
Автор

Thank you! I finally understand Stable Diffusion!

maxkhan
Автор

Thanks Jay for all your efforts to share a bit of your knowledge in AI.
I am not an expert, by far, but I came to the conclusion that AI is mainly a construction of hundreds of lego bricks, assembled together into specific architectures and trained with the same gradient back propagation algorithm. Some of them perform well some other don't.
Therefore, the only genuine piece of AI theory is the mathematical background of the training algorithm. The rest is pure heuristics more or less well explained, a kind of AI cook books with ad hoc recipees.
The training algorithm itself seems very limited (even if highly powerful), since it is applied in a centralized way onto a predefined architecture and does not participate to the architecture topology definition. In other words, the topology is defined before the training while, intituively, the training should probably define the topology.
Therefore incremental learning remains a big issue in most of the AI architectures if not all.
This lack of a consistent and unified AI theory (there is no, to my limited knowledge, any AI theorems nor demonstrations that some sort of optimum is reached using a given architecture) makes me believe that we are at the very beginning of a new science still to come.
Could you react to the above humble considerations and share your thoughts ?
Kind regards,

d.p.
Автор

Thanks Jay for the video, the concept of converting noised image to a clear image is understood.

How does it creates a image which doesn't exist in its training ?
It is understood that the model doesn't understand the concepts of the image and only focuses on the patterns.
But how is the below operations performed,

1. Creating a cartoon image of cat based on caption ex: Place a hat on top of cat
How does it creates a cartoon image of cat ?
How does it know the exact location of cat's head ?
How does it know to place the hat exactly at the head ?

2. A closeup shot of a dog facing the sun
How does it knows to create a close shot of a dog ?
How does it know to place the sun in the background ?
How it makes the the object to turn towards the sun ?


No videos exist to explain this concept. It would be of great help if you could make a video on this.

karthik
Автор

It renders the image from text instead of a 3D model. Its like Maya—but with words, and using 1B+’pre-trained models (images with their text descriptions) from the Internet wired up with plain English, so you don’t have to build the models in 3D, you can just type what you want to create using plain English, and the AI renders out the image.

JohnGilbertmoore
Автор

Thanks for the explanation. Can you please make a 1 hr or 2hr video with more deep dive into the internal? Maybe you already have it recorded I guess. Thanks.

muhammedaneesk.a
Автор

👆just like the transformers series, excellent

treksis
Автор

Dear Sir, I am your Subscriber

I want to create a tool that finds text errors in the image.

For Example:
if I forgot to write CONTACT US, BUY NOW, CONTACT NUMBER, SPELLING MISTAKE, etc... in my social media post.
that the tool finds error and suggests what are missing or what is incorrect in social media post.

🙏 Please guide me and suggest what course I need to buy or what I need to learn to create this tool
Thank you!

anilsharmag
Автор

Is this simplified explanation of the process of noise in Stable Diffusion true?

It's like teaching an artist about our visual world -- object definitions, shapes, dimensions, etc., and how they correspond to the person who commissioned the art (text prompts).

The artist then watches a mosaic - say of an ice cream - being inserted by hundreds of tesserae (rectangular slabs used to create a mosaic) and then removed to restore the original mosaic. During this, the artist learns how to understand, recreate, and reinterpret the ‘ice cream’ image in other mosaics. The artist goes through this with millions of other depictions in mosaics (objects, locations, etc.) so they can create entirely new mosaics based on the requests (or text prompts) of the person commissioning them.

Sampling steps are like commissioning an artist to interpret and construct a mosaic quickly or carefully. The more detail or accuracy you want, the more work and time have to go into it.

jamiewatts
Автор

Thanks Jay - I had been looking for something that does more than describe the denoising process and the attention bit related to prompts is what I was missing. That said, I still can't quite understand how you get a completely new image. I can understand that you should be able to get back to an original image (say a dog, or a flower) via the noisification and reverse process, but how can it, say, create an image with a flower and the dog such they are integrated in some way? Where does that data that come from? A visual example of the earlier stages which show this would be helpful. The examples you had jumped from basically to an image (albeit unrefined) in 3 steps - I'd like to see this broken down so I can "see" what is happening. Still requires a level of acceptance without evidence that I am not happy with....

daveonvr
Автор

Would you kindly tell me if it is possible to sell the artwork that I made with stable diffusion, and does the administration allow this, and how can I communicate with them i mban the mangemment or soppert for this program-, and where can the pictures be sold as pieces of art? I do not speak English, help me

FACTSABOUTGAMES
Автор

Simple question: does that mean it can't create a prompt (or specific word) that it hasn't been trained on? Thank you for your video!

mostlynotworking
Автор

*AI Pictures. Art means craftsmanship and personal expression

CptBlaueWolke
Автор

Not all photographs are art, but photography can be an art. The nonsense I draw in a game of Pictionary has no craftsmanship or personal expression. However, illustration is an important form of art.

Not only would the AI never produce art on its own, it would never produce anything. The amount of craftsmanship and personal expression being put into the image is dependent on the person using it. A low effort random prompt to the AI is arguably not art, but that's not really the point.

nerdfinite
Автор

Thanks, great video again, but Your voice has a lot of sibilants, making the listening experience is atrocious. If you make enough money making these videos, I suggest hiring a professional audio producer/mixing guy to clean up the audio. Email me, I'll suggest someone.

simawpalmer