CUDA Crash Course: GPU Performance Optimizations Part 1

preview_player
Показать описание
In this video we look at a step-by-step performance optimization of matrix multiplication in CUDA!

Рекомендации по теме
Комментарии
Автор

We need more videos like this (in-depth performance tuning, wtih profiling and analysis). Good work, man.

syfaiz
Автор

Great material, it's really difficult to wind some tips for beggining CUDA programmers. I come from pytorch and thanks to you i successfully implemented multiplication of dense vector times sparse binary matrix multiplication. My usecase is very specific and very demanding in terms of performance, so i your videos really helped a lot. Thanks!

mhnatiuk
Автор

Another optimization strategy is to try having block size equivalent to the multiple of warp size and the grid size equivalent to the multiple of the number of SMs in the GPU. Well this may not apply in your example as your block size is already a multiple of warp size. Really enjoyed your explanation.

muneshchauhan
Автор

Hey Nick - you've been tremendously helpful. Thanks for your insights!

AndrewCodeDev
Автор

Hey Nick, your videos are truly a life saver, you covered so many important topics and are never afraid to get your hands dirty :).
Can we expect a part 2 of this video, or is it already somewhere outside of this playlist?

aleksandarcvetkovic
Автор

Great content Nick. Helped me a lot with understanding things better :)

rahulramesh
Автор

Love it, just wanted to add my two cents: when you introduce coalescing, it was a bit confusing to me what exactly you meant. But the improvements from your change exactly should be:
a) only one 32 byte (minimum size) read transaction from A per warp per iteration instead of 2
b) coalesced 128 byte read transaction from B per warp per iteration
c) coalesced 128 byte write transaction at the very end (likely the least significant)

SnoSixtyTwo
Автор

Cool tutorial. You explain everything so clearly.

jianxiang
Автор

Hey Nick, awesome video! Curious when part 2 is going to come out

jerrickhoang
Автор

Teached me alot man! Will have to go through couple of times to get the full extent of it tho 😅

eladon
Автор

where is the next video or part2? I heard you said that the next video is more about optimization of tiled matrix multiplication, but i can't find any of that in your channel. I'm dying to watch it.

jfd
Автор

amazing thank you very much for this video

eaemmm
Автор

how did you compute `int row` and `int col`, is there another guide that I can follow?

tushargarg
Автор

Hi nick, great content, but the #pragma unroll actually made the performances worse on the system I'm running it on (Nvidia V100).
Typically for a 10k * 10k matrix, I'm going from 720ms to 750ms. Any idea why that is ?

thibautmodrzyk
Автор

Hi nick, i'm trying to profile the matrix multiplication cuda code, it is the same as your naive matrix multiplication code with nvidia Nsight, i tried with 1<<10 and it worked, and i tried with 1<<11 and the profiler didn't catch the kernel launch. i have nvidia GTX 960M gpu.
so is it the problem with my gpu capability or there is something else wrong?
thanks in advance.

_lilkm