Lecture 1 How to profile CUDA kernels in PyTorch

preview_player
Показать описание
Рекомендации по теме
Комментарии
Автор

'i believe thing i see'

I'm in the right place. Thanks!!

mlock
Автор

Thanks for this course. It's very useful to me and my team. Thanks

burnessduan
Автор

at 30:40 where you change the BLOCK_SIZE to 1024. How is it possible to reach 8000GB/s when max memory bandwidth of A10G is only 600GB/s? I think setting BLOCK_SIZE = 1024 makes triton compute only the first 1024 columns of the matrix while ignoring the rest, so when you computing the GB/s, the "seconds" part is fixed, while the "GB" grows linearly (128 * i), that's why you're seeing the perf growing linearly. Also the reason why the little `torch.allclose` test didn't complain, is that you are only testing a small matrix (1823, 781) here, whose n_cols <= 1024.

loabrasumente
Автор

Nice walk-through mark!

So in practice on a high level one would profile the code, identify the perf bottlenecks and then replace some of the functions associated with that bottleneck with a direct CUDA/Triton implementation?

TheAIEpiphany
Автор

Do you have any suggestions for comprehensive resources or study materials that can help a beginner learn about CPUs and GPUs, particularly focusing on their roles and functions in Machine Learning and Deep Learning? I'm looking for in-depth yet accessible information to build a strong foundation in this area, which will enable me to understand the technical aspects discussed in certain videos related to ML/DL, especially this one :).

elliot
Автор

oh no. now I have no excuse to be a productive member in my village. Oh, I accidentally subscribed, the terror.

zerotwo
Автор

I don’t have a GPU at home. Where can I find best access to GPU with access to NCU nccompute ? Getting the environment seems key

vivekkaul
visit shbcf.ru