Double Your Stable Diffusion Inference Speed with RTX Acceleration TensorRT: A Comprehensive Guide

preview_player
Показать описание
Stable Diffusion Gets A Major Boost With RTX Acceleration. One of the most common ways to use Stable Diffusion, the popular Generative AI tool that allows users to produce images from simple text descriptions, is through the Stable Diffusion Web UI by Automatic1111. In today’s Game Ready Driver, NVIDIA added TensorRT acceleration for Stable Diffusion Web UI, which boosts GeForce RTX performance by up to 2X. In this tutorial video I will show you everything about this new Speed up via extension installation and TensorRT SD UNET generation.

#TensorRT #StableDiffusion #NVIDIA

Automatic Installer Of Tutorial ⤵️

Tutorial GitHub Readme File ⤵️

0:00 Introduction to how to utilize RTX Acceleration / TensorRT for 2x inference speed
2:15 How to do a fresh installation of Automatic1111 SD Web UI
3:32 How to enable quick SD VAE and SD UNET selections from settings of Automatic1111 SD Web UI
4:38 How to install TensorRT extension to hugely speed up Stable Diffusion image generation
6:35 How to start / run Automatic1111 SD Web UI
7:19 How to install TensorRT extension manually via URL install
7:58 How to install TensorRT extension via git clone method
8:57 How to download and upgrade cuDNN files
11:23 Speed test of SD 1.5 model without TensorRT
11:56 How to generate a TensorRT for a model
12:47 Explanation of min, optimal, max settings when generating a TensorRT model
14:00 Where is ONNX file is exported
15:48 How to set command line arguments to not get any errors during TensorRT generation
16:55 How to get maximum performance when generating and using TensorRT
17:41 How to start using generated TensorRT for almost double speed
18:08 How to switch to dev branch of Automatic1111 SD Web UI for SDXL TensorRT usage
20:33 The comparison of image difference between TensoRT on and off
20:45 Speed test of TensorRT with multiple resolutions
21:32 Generating a TensorRT for Stable Diffusion XL (SDXL)
23:24 How to verify you have switched to dev branch of Automatic1111 Web UI to make SDXL TensorRT work
24:32 Generating images with SDXL TensorRT
25:00 How to generate TensorRT for your DreamBooth trained model
25:49 How to install After Detailer (ADetailer) extension and what does it do explanation
27:23 Starting generation of TensorRT for SDXL
28:06 Batch size vs batch count difference
29:00 How to train amazing SDXL DreamBooth model
29:10 How to get amazing prompt list for DreamBooth models and use them
30:25 The dataset I used for DreamBooth training myself and why it is deliberately low quality
30:46 How to generate TensorRT for LoRA models
33:30 Where and how to see TensorRT profiles you have for each model
36:57 Generating LoRA TensorRT for SD 1.5 and testing it
39:54 How to fix TensorRT LoRA not being effective bug
Рекомендации по теме
Комментарии
Автор

If I have been of assistance to you and you would like to show your support for my work, please consider becoming a patron on 🥰 ⤵

Technology & Science: News, Tips, Tutorials, Tricks, Best Applications, Guides, Reviews ⤵

Playlist of StableDiffusion Tutorials, Automatic1111 and Google Colab Guides, DreamBooth, Textual Inversion / Embedding, LoRA, AI Upscaling, Pix2Pix, Img2Img ⤵

SECourses
Автор

Holy smokes! I would have never thought to look into this without your tutorials. I cannot believe I have had this RTX for months and I did not think to do this. Maestro!! ('.')7

reapicus
Автор

So much value in this video, thank you for sharing this for free !
And its amazing that nvidia did this repo, probably in 2 years auto1111 will be considered like photoshop, stable diffusion skills will be valuable

ArtificialBeauties
Автор

very interesting, thanks! Works well on non dev automatic1111 version.

VooDooEf
Автор

Thank you! I hope there will be something like this for comfy

captainoctonion
Автор

RTX 3060Ti 4.28it/s>6.63it/s new sub, thanks. (i use the default engine)

RANDOM-ixpn
Автор

Performance is quite promising, I install using A1111 v1.6 without any problem, the speed are quite fast with 4 second generation compare to 6 seconds at resolution 768x768
but the time it takes to export engine each time we need to switch different resolution, or checkpoint or Lora is very time consuming. Sometime around more than 30mins for higher resolution to export engine using my RTX3060 12gb vram

kenrock
Автор

Now I guess you can install and work. Initial errors are removed, but there will be others:)
Thanks Furkan

michail_
Автор

Very informative video! Much appreciated.

covninja
Автор

Quite some huge packages to d/load :))

JackTorcello
Автор

If you've done a recent installation, the CUDA dll files will already be up to date and TensorRT will work already.

絵空事-oe
Автор

Img2img da 1152x1152 0.55 denoising ayarında RTX3080Ti ile 56snde render alırken TensorRT ile 15snde render alıyorum. Teşekkürler anlatım için. Ek olarak min ve optimum prompt token 75de tutulmalı ve sadece max tokeni yükseltmeliyiz (min ve optimum ne yaparsanız yapın eşitleniyor ve bug oluyor.). Bir de TensorRT profil modellerini silseniz bile sistemde hala var gözüküyor ve çalışmama bugu oluyor bunu kaldırmak için manuel olarak models.json u editlemek lazım.

Imquorra
Автор

Thanks for the video. How do I convert the new created JSON file to TRT file?

erickevinz
Автор

I don't have a nvidia folder in venv\Lib\site-packages although I did install the tensorRT extension from the extension tab.

vfbotgl
Автор

Like someone mentioned, for 4090 not worth. And every time generate model is non sense..

pastuh
Автор

Right now this is more of a proof of concept. It has some uses when running fixed pipelines with larger volumes of images. But there is not much benefits for the average A1111 user. In fact most of the time it will just eff up your workflow due to all the limitations

lennylein
Автор

Hey! Found u on github, question about min max prompt token count in tensorrt, did u try >75? There is 0.3 beta on github, but looks like no fix for that problem, issue still opened

chf
Автор

This is really useful for speeding up SDXL image generation. However, this thing requires much more bigger VRAM, need at least an Nvidia GPU with 12GB VRAM with Sysmem Fallback enabled. During the process, user should not doing anything besides it eg. browsing the internet to avoid the process interrupted abnormally.

Also, TensorRT will not work when --medvram or --lowvram flags enabled.

AmirZaimMohdZaini
Автор

keep getting ERROR:root:Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
No clue how to solve this. Been looking everywhere! I only have 4090 rtx, I dont have more than 1 gpu. Im in dev mode too! I went to your video hoping to find a solution.

TechMDYoutube
Автор

Hey!! Minute 32:32 you jumped the issue with the Loras not appearing in the list. I have this issue as well and it's not coming up after a restart. Any solution? Thank you, amazing video!

pablo.montero