Lesson 9A 2022 - Stable Diffusion deep dive

preview_player
Показать описание


00:00 - Introduction
00:40 - Replicating the sampling loop
01:17 - The Auto-Encoder
03:55 - Adding Noise and image-to-image
08:43 - The Text Encoding Process
15:15 - Textual Inversion
18:36 - The UNET and classifier free guidance
24:41 - Sampling explanation
36:30 - Additional guidance

Рекомендации по теме
Комментарии
Автор

Love your simple explanation of a manifold Jonathan. It's the first time it's made sense to me. Looking forward to the coming lectures.

markhopkins
Автор

Appreciate this supplemental deep dive into code of stable diffusion!

timandersen
Автор

Thank you for this deep dive. The sampling explanation especially was helpful to try to get an intuition for what the model does.

al
Автор

Also, are we using latent space as gradients here, as we are subtracting gradients from the latent, which we typically do from weights in conventional NN ?

adityagupta-hmvs
Автор

Please explain the ancestral samplers.

alexrichmonkey
Автор

How do we decide the scaling factor in VAE part i.e. 0.18215, any hint on how to decide it ? I did try changing and could see the different output, but what's a good way to choose ?

adityagupta-hmvs
Автор

pretty good video to further understand SD!

saidmoglu
Автор

I finally understand the schedulers! Thank you!

spider
Автор

This is useful but I wish you went into more detail here and there. Is some CLIP or similar model included in the stable diffusion implementation? If so, are precomputed weights of the CLIP model used to calculate noise_prediction in each step? I.e. we pass the current noisy image (in a latent space) and the text embedding to CLIP and then calculate the gradient for each voxel of the image so that something (semantic similarity?) is maximized? I wish you would say what happens during training of the mode and what then happens during inference :).

climez
Автор

I'm surprised how... complexity(?) raised up. It's second day and I only on 4th minute, spent 30 minutes debugging my coding-along session (I wrote rand_like instead of randn_like and my parrot photo went green instead of grambled)

AM-ykyd
Автор

Why do you "sample()" from the latents? Does this mean the latents are not the same between runs?

JohnSmith-hexg
Автор

How can it perform the custom action. basically how can we fine tune it for our input and target image we want as per our text action

jaivalani
Автор

Don't like all this jumping around. Would be much easier to simply go through it, in a linear fashion, explaining as you go. Disappointing

DinoFancellu