From Scratch: Cache Tiled Matrix Multiplication in CUDA

preview_player
Показать описание
In this video we look at implementing cache tiled matrix multiplication from scratch in CUDA!

Рекомендации по теме
Комментарии
Автор

Genius and cool. I'm a C++ Dev and now I'm able to use CUDA .. to play around with Mandelbrot, NVH, FFT and things like that .

TheNaso
Автор

Sorry I have a stupid question, dim3 threads(THREADS, THREADS); does this mean for each block we have 256 threads because 16 *16?

ALIENrobot
Автор

Hey Nick, great tutorial! Dunno if you still active on this. I do understand from reading nvidia dev docs than you have to pad for non-square matrices. But, is it possible to calculate C = A^T * B, by directing accessing memory in matrixMul Kernel? What i mean is converting A into A^T directly into shared memory.

Mrfaces
Автор

Hey @Nick,
How can we change the size of the tile from 16 to 8 or 2? I changed the 16 x 16 x 4 in SHMEM to 2 x 2 x 4, but verification fails

jugalgore
Автор

Thanks. Because i am lazy i ask, if the elements could be found with two index like i[row][col]. Or is calculating into single index faster. Inside kernel.

allankiipli
Автор

You didnt cudaFree the memory, is this not needed ... ?

alexzan
welcome to shbcf.ru