Give compute shader raytracing a significant speedup with this one trick.

preview_player
Показать описание
#gamedev #gamedevelopment #programming

Рекомендации по теме
Комментарии
Автор

I remember making a CUDA pathtracer and I found a group size of 8x8 or 16x16 was really good and increased my performance just like you have discovered.

awiseseal
Автор

I think this translates well to CUDA, where you don't launch <<<1, 1>>>, but rather <<<64, 512>>> size kernels depending on what's happening. I assume the same thing is happening here. Each compute unit (CU) has a bunch of hardware-implemented threads (like AVX, for example), so 1, 1 will just use 1 thread on 1CU. Nvidia GPUs have 32 "threads" per CU, AMD have 64.

By making the launch 8, 8, that's 64 threads CU, so you might ask, aren't the next 32 gonna wait, since a CU can only fit 32? Yep! But the scheduler can interleave them, or when one batch of 32 threads is waiting on a memory request, the other 32 can sneak in and run. That's why you still see a gain.

Going from 1, 1 -> 2, 2 that's 4x improvement to "occupancy" (1 thread to 16 threads per CU), so we may expect to see a ~4x speedup. The FPS went from 50 -> 180 = 3.6. That's pretty close!
And then, going from 2, 2 -> 4, 4 that's another 4x expected, and we again see ~2.33. Hmm not as much, but hey still good! Remember though there are other parameters at play too! Memory read/write, opengl overhead etc.
Now going from 4, 4 -> 8, 8 doesn't give us all that much. Here, we're probably memory bound, rather than being limited by compute capacity of a CU.

But, the important takeaway is, make sure you fill your CU! Know your hardware! Software abstractions are fine, but at the end of the day, you're juggling electrons, and pushing them down the pathways purpose built for the task is the way to achieve the highest performance!

(I'm either speaking out of my ass here or the concept still applies haha)

dexterman
Автор

That's all working because for GPU it's easier to compute all pixels in the buffer, that work on thousand subgroups...

davnoa
Автор

It is wild to me that your workgroup size was 1x1. I have never seen anyone go below 8x8.

SilverXenolupus
Автор

This is an incredible trick!! The only downside Ive encountered is sometimes it doesnt render the pixels on the far right and bottom of the window.

christophercoronaios
Автор

Just wow!! in less than 5 some videos take decades to explain!! #ExponentialDelivery *whoa*!!!

t.e.k.profitstraders
Автор

Can someone explains the reason behind this speed up?

alexlau