CUDA Crash Course: GPU Performance Optimizations Part 1

Показать описание

In this video we look at a step-by-step performance optimization of matrix multiplication in CUDA!

Рекомендации по теме

Комментарии

We need more videos like this (in-depth performance tuning, wtih profiling and analysis). Good work, man.

syfaiz

Great material, it's really difficult to wind some tips for beggining CUDA programmers. I come from pytorch and thanks to you i successfully implemented multiplication of dense vector times sparse binary matrix multiplication. My usecase is very specific and very demanding in terms of performance, so i your videos really helped a lot. Thanks!

mhnatiuk

Another optimization strategy is to try having block size equivalent to the multiple of warp size and the grid size equivalent to the multiple of the number of SMs in the GPU. Well this may not apply in your example as your block size is already a multiple of warp size. Really enjoyed your explanation.

muneshchauhan

Hey Nick - you've been tremendously helpful. Thanks for your insights!

AndrewCodeDev

Hey Nick, your videos are truly a life saver, you covered so many important topics and are never afraid to get your hands dirty :).
Can we expect a part 2 of this video, or is it already somewhere outside of this playlist?

aleksandarcvetkovic

Great content Nick. Helped me a lot with understanding things better :)

rahulramesh

Love it, just wanted to add my two cents: when you introduce coalescing, it was a bit confusing to me what exactly you meant. But the improvements from your change exactly should be:
a) only one 32 byte (minimum size) read transaction from A per warp per iteration instead of 2
b) coalesced 128 byte read transaction from B per warp per iteration
c) coalesced 128 byte write transaction at the very end (likely the least significant)

SnoSixtyTwo

Cool tutorial. You explain everything so clearly.

jianxiang

Hey Nick, awesome video! Curious when part 2 is going to come out

jerrickhoang

Teached me alot man! Will have to go through couple of times to get the full extent of it tho 😅

eladon

where is the next video or part2? I heard you said that the next video is more about optimization of tiled matrix multiplication, but i can't find any of that in your channel. I'm dying to watch it.

jfd

amazing thank you very much for this video

eaemmm

how did you compute `int row` and `int col`, is there another guide that I can follow?

tushargarg

Hi nick, great content, but the #pragma unroll actually made the performances worse on the system I'm running it on (Nvidia V100).
Typically for a 10k * 10k matrix, I'm going from 720ms to 750ms. Any idea why that is ?

thibautmodrzyk

Hi nick, i'm trying to profile the matrix multiplication cuda code, it is the same as your naive matrix multiplication code with nvidia Nsight, i tried with 1<<10 and it worked, and i tried with 1<<11 and the profiler didn't catch the kernel launch. i have nvidia GTX 960M gpu.
so is it the problem with my gpu capability or there is something else wrong?
thanks in advance.

_lilkm

CUDA Crash Course: GPU Performance Optimizations Part 1

CUDA Crash Course: GPU Performance Optimizations Part 1

Nvidia CUDA in 100 Seconds

CUDA Programming Course – High-Performance Computing with GPUs

Writing Code That Runs FAST on a GPU

CUDA Crash Course (v2): Pinned Memory

CUDA Crash Course: Why Coalescing Matters

CUDA Crash Course (v2): Vector Addition

Buying a GPU for Deep Learning? Don't make this MISTAKE! #shorts

Nvidia vs Custom AI Chips: Why Big Tech Can’t Replace the GPU King!

CUDA Crash Course: Comparing Sum Reduction Implementations

CUDA Crash Course: Device Properties

Dynamic Data Structures on the GPU

03 CUDA Fundamental Optimization Part 1

CUDA Crash Course (v2): Unified Memory

CUDA Crash Course: Comparing Matrix Multiplication Implementations

GPU Programming with CUDA and Thrust | Foundations | Basics | C++ | Short Course!

GPUs: Explained

CUDA Crash Course: Sum Reduction Part 1

The BEST Programming Languages by Bjarne Stroustrup - Creator of C++ #shorts #programming #C++

CUDA Crash Course: 1-D Convolution with Constant Memory

CUDA Crash Course: Matrix Multiplication

CUDA Crash Course: OpenACC Matrix Multiplication

ARCHER Virtual Tutorial: GPU Programming with CUDA

Intro to CUDA (part 1): High Level Concepts