From Scratch: Cache Tiled Matrix Multiplication in CUDA

Показать описание

In this video we look at implementing cache tiled matrix multiplication from scratch in CUDA!

Рекомендации по теме

Комментарии

Genius and cool. I'm a C++ Dev and now I'm able to use CUDA .. to play around with Mandelbrot, NVH, FFT and things like that .

TheNaso

Sorry I have a stupid question, dim3 threads(THREADS, THREADS); does this mean for each block we have 256 threads because 16 *16?

ALIENrobot

Hey Nick, great tutorial! Dunno if you still active on this. I do understand from reading nvidia dev docs than you have to pad for non-square matrices. But, is it possible to calculate C = A^T * B, by directing accessing memory in matrixMul Kernel? What i mean is converting A into A^T directly into shared memory.

Mrfaces

Hey @Nick,
How can we change the size of the tile from 16 to 8 or 2? I changed the 16 x 16 x 4 in SHMEM to 2 x 2 x 4, but verification fails

jugalgore

Thanks. Because i am lazy i ask, if the elements could be found with two index like i[row][col]. Or is calculating into single index faster. Inside kernel.

allankiipli

You didnt cudaFree the memory, is this not needed ... ?

alexzan

From Scratch: Cache Tiled Matrix Multiplication in CUDA

From Scratch: Cache Tiled Matrix Multiplication in CUDA

CUDA Crash Course: Cache Tiled Matrix Multiplication

Dividing N by N Matrix into Tiles - Intro to Parallel Programming

Performance x64: Cache Blocking (Matrix Blocking)

3 2 6 Reduce Miss Rate by Blocking

Tiling - Intro to Parallel Programming

Cache Blocking using Tiling in a Molecular Dynamics Application Benny Mathew and Manoj Nambiar

From Scratch: Matrix Multiplication in CUDA

Tiling - Intro to Parallel Programming

L4c How To Do Cache-Blocking Of Matrix Multiplication and CONV

HetSys Course: Lecture 9: Advanced Tiling for Matrix Multiplication (Spring 2023)

Adding Nested Loops Makes this Algorithm 120x FASTER?

Cache-Oblivious Matrix Multiply

Episode 5.14 - Example of Cache-Oblivious Recursion

5.4.2Animation of High Performance Matrix-Matrix Multiplication

Computer memory #1 Cache optimization and fast matrix iteration | Scientific computing & HPC

Compiler Design Module 127 : Blocking in Matrix Multiplication

How AI Discovered a Faster Matrix Multiplication Algorithm

14. Caching and Cache-Efficient Algorithms

L1 Cache Usage in Optimised matrix multiplication micro-kernel in C++

Defensive Loop Tiling for Shared Cache

L4a L4b Cache Blocking Of Matrix Transpose

CUDA Matrix Multiplication Shared Memory | CUDA Matrix Multiplication Code and Tutorial

Our new ISIT 2021 paper on 'Cache-Aided Matrix Multiplication Retrieval'.