GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA

Показать описание

Come for an introduction to programming the GPU by the lead architect of CUDA. CUDA's unique in being a programming language designed and built hand-in-hand with the hardware that it runs on. Stepping up from last year's "How GPU Computing Works" deep dive into the architecture of the GPU, we'll look at how hardware design motivates the CUDA language and how the CUDA language motivates the hardware design. This is not a course on CUDA programming. It's a foundation on what works, what doesn't work, and why. We'll tell you how to think about a problem in a way that will run well on the GPU, and you'll see how the CUDA programming model is built to run that way. If you're new to CUDA, we'll give you the core background knowledge you need — getting started begins with understanding. If you're an expert, hopefully you'll face your next optimization problem with a new perspective on what might work, and why.

Coding In Rust

Рекомендации по теме

Комментарии

This video is pure gold: thanks so much for uploading I've learnt so much from it. I may have to watch it several times though!!! A great overview and introduction to so many areas for further study.

citizensmith

One thing that's confusing: if reading from a memory location in a different row is 3x slower than reading from a memory location in the same row - how come we get 13x slowdown? Worst case (if you're deliberately reading from a different row each time) - one would expect a 3x slowdown?

What am I missing out on? Is it the burst mode?

2) You're using float2 type so that means your thread is loading 4 bytes (for 2 points) not 8 bytes? Which would put the 4 warps into 512B loading territory instead of the optimal 1024? -> EDIT: ok, I just saw that p1 & p2 are actually float pointers so that does make sense.

3) How can we guarantee that p1 & p2 arrays (holding the points) are adjacent, i.e. in the same physical row in memory?

Great video! The sound quality is a bit off though.

TheAIEpiphany

Excellent. For the matrix multiply, you’re reusing the same row multiple times but the columns would have to be loaded in every time. So how do you increase compute intensity of the columns?

steveHoweisno

Christopher, do you think the long time it takes for ram to be accessed could be decreased by embedding a basic cpu in those ram modules?

webgpu

33:10 FlashAttention proved this wrong

codingmachine

Look like Intel is out of the question here.

dGooddBaddUgly

Is there a chance you can do a video about Why AMD's version isnt as good as NVIDIA ?

GeorgePaul

GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA

How CUDA Programming Works | GTC 2022

GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA

GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

CUDA New Features and Beyond | GTC 2022

CUDA: New Features and Beyond | NVIDIA GTC 2024

How Nvidia Grew From Gaming To A.I. Giant, Now Powering ChatGPT

HOW CUDA PROGRAMMING WORKS . GTC2022

GTC 2022 Spring Keynote with NVIDIA CEO Jensen Huang

How GPU Computing Works | GTC 2021

GTC March 2024 Keynote with NVIDIA CEO Jensen Huang

CVCUDA/CV-CUDA - Gource visualisation

Runway Optimizing AI Image and Video Generation Tools Using CV-CUDA

CUDA 12 New Features and Beyond

Writing Code That Runs FAST on a GPU

GTC 2022 Keynote Highlights & Update

Nvidia CEO Jensen Huang full keynote at GTC 2024

CUDA Tutorials I CUDA Compatibility

Final Plans for NVIDIA GTC 2022

Nvidia CUDA Explained – C/C++ Syntax Analysis and Concepts

OpenAI's ChatGPT CUDA Lessons - 001 Hello World - Learn CUDA C++ Here | Kinvert

Parallel Computing with Nvidia CUDA

Cuda Graphs

GTC Sept 2022 Keynote with NVIDIA CEO Jensen Huang

AI and The Next Computing Platforms With Jensen Huang and Mark Zuckerberg