ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models

preview_player
Показать описание
ControlNet is a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. ControlNet supports various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. Training of ControlNets is robust with small ( less than 50k) and large (more than 1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

In this video, I will talk about the following: What is conditional control and why it is difficult? What is the ControlNet architecture? How does training and inference work for ControlNet? How does ControlNet perform?

Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." In ICCV, pp. 3836-3847. 2023.
Рекомендации по теме
Комментарии
Автор

Respected Sir, As per my survey during training of diffusion model only MSE loss is used. Is it possible to add other kind of losses like classification similar to GAN?

anig