FASTER Ray Tracing with Multithreading // Ray Tracing series

preview_player
Показать описание


🧭 FOLLOW ME

📚 RESOURCES (in order of complexity)

💾 SOFTWARE you'll need installed to follow this series

CHAPTERS
0:00 - Optimization
1:25 - Multithreading and Rendering
7:05 - Making our code multithreaded

Welcome to the exciting new Ray Tracing Series! Ray tracing is very common technique for generating photo-realistic digital imagery, which is exactly what we'll be doing in this series. Aside from learning all about ray tracing and the math to goes into it, as well as how to implement it, we'll also be focusing on performance and optimization in C++ to make our renderer as efficient as possible. We'll eventually switch to using the GPU instead of the CPU (using Vulkan) to run our ray tracing algorithms, as this will be much faster that using the CPU. This will also be a great introduction to leveraging the power of the GPU in the software you write. All of the code episode-by-episode will be released, and if you need help check out the raytracing-series channel on my Discord server. I'm really looking forward to this series and I hope you are too! ❤️

This video is sponsored by Brilliant.

#RayTracing
Рекомендации по теме
Комментарии
Автор

Maybe a Matt Parker fenomena erupts from the internets and we get a 40, 832, 277, 770% improvement. Or maybe not, cause we are not starting with python.

FabricioSTH
Автор

This video seems made for me. I was on a huge time crunch so I had to implement a ray tracer with reflections, BVH, etc, in about 36 hours total. It took a lot of coffee but I got it done. It's reasonably performant but I rendered a similar scene using cycles in blender and it is simply so much faster. What takes blender seconds takes me minutes, even with mutlithreading, and I don't have "fancy" features such as texture mapping running yet

blackbriarmead
Автор

4:10 thats actually wrong, gpus dont have thousands of cores, what they have is bigger simd widths usually 64-256, cpus also have simd widths of 8-16 so you can actually turn your cpu into a gpu if you are willing to vectorize or use intrinsics.

Theawesomeking
Автор

In my own version of this, I initially tried grouping a chunk of rows per thread and got good improvements. But then I noticed that certain blocks would take longer to run if there were lots going on in the image, so you'd have 1 thread working alone when all others were finished. I ended up using a threadpool and allocating each thread in the pool to work on one pixel, once that pixel was calculated the thread would go back in the pool and pick up the next pixel to work on. This worked very well and keeps maxing out the cpu until there are fewer pixels left to calculate than than cores available to work on them.

I'd love to change the code to work on GPU, and I did try for a while to get Metal to work but just couldn't work it out…

bishboria
Автор

Isn't `std::execution::par` enforcing sequential execution, which is not required here? I believe simply switching to `std::execution::par_unseq` would be an instant speedup.

But ultimately, thread creation have a overhead, and creating exactly as many threads as there is logical core and distributing the work would be faster.

But then again, not all threads would have the same amount of work since some pixels take longer than others, so to fully saturate all threads for the whole frame it would be better to use a thread-stealing threadpool.

However, maybe exactly N threads (N: amount of logical cores) might still be faster even if not fully well balanced if you have distinct tiles with thread-local data for them for better cache locality...

Kazyek
Автор

Excellent challenge, cheers Cherno! Loving this series.
There's a lot of optimisation possible here - 2x faster (around 60ms/16.6fps to 30ms/33.3fps) is some way below what we'd expect from fully independent worker units (check: are they? include worker timer and look for normal/abnormal timing distribution), all this assuming maximum threads isn't set to 2, of course ;)
I'd also be checking the thread allocation process (hint: another, more 'direct' way?), and making sure the work is 100% optimally split up, and 100% optimally allocated to the maximum threads returned from hardware_concurrency() (though historically not 100% guaranteed to work (return 0), don't know if it's now fixed...been a while for me).

ChrisM
Автор

YO BELATED HAPPY BDAY MAN! Wish i came sooner, would have wished you on the day :( HAVE A GOOD ONE EVERYDAY

srisairayapudi
Автор

Superior performance can also be achieved with techniques other than multi-threading. In fact, threading can actually be slower when the synchronization effort outweighs the performance gains (see Amdahl's law).

First, notice that CPUs already have parallelism built into their instruction feed pipeline. Fetching / decoding / executing / writing back results can be performed in parallel for successive instructions if they don't depend on each others results. Rearranging commands in assembly can have crazy gains (but with C/C++ we usually don't dig that deep).

Second, there are dedicated SIMD instruction sets on modern CPUs that can perform the same operations for multiple inputs (256 / 512 bit wide registers) at once to increase data throughput (e.g. 8 or 16 float ops at once).

Third, avoiding allocation can save lots of compute, too. Preprocessing data only once upfront is very nice. And having smaller stack frames to allocate / destroy is also important. Using some static, rewritable cache memory that's owned by one thread can really help performance such that there's smaller stack frames (at the downside of non-threadsafe code).

And last, there are different CPU caching layers which have 1000x faster I/O delays. So fitting all the memory in a faster cache and constantly reusing it will skyrocket the performance. CPUs have great latency once the data is loaded into a register. Small and simple is fast.

Maybe this inspires some devs here to write faster programs. Cheers, have fun at optimizing 🤓👨🏻‍💻🏎️

marcotroster
Автор

Messing around trying to optimise the code a bit, I noticed that your implementation of Random::InUnitSphere() is wrong. It's biased towards values in the directions of the corners of the unit box surrounding the sphere (because it draws a sample from the unit box and then normalizes that sample to fit on the surface of a sphere).

Alkanen
Автор

Hey, thanks for the series. The first video kick started my learning process about path tracing. In my opinion the series was a little slow and I was eager to outpace it. So I wrote a Vulkan path tracer in rust and learned most things by doing them. now I'm writing my Bachelors Thesis about differentiable path tracing. Btw. Mitsuba3 is a great tool for learning about path tracing as well, especially if you don't want to deal with C++. Anyways Thanks for the inspiration.

theonetribble
Автор

I really appreciate this! I've always wondered how multithreading is implemented but always got stuck in the syntax. Are there any plans on showing how to set up rendering on a graphics card?

jfgh
Автор

Thank you very much for your effort you put in this video. I've learnt a lot from your tips.
Could you please make some videos about how to optimize in the case when the computation on one pixel related to pixels around (Example : convolution, Gaussian filtering...)
Once again, thank you and have a nice year !

manuntn
Автор

can you take a look at the issues and PR's on the walnut repo?
It has some serious problems right now

nathans_codes
Автор

I am not a programmer nor a developer but I am actually curious.
Would it be possible to say to the hardware:

" hey can you run raytracing in parallel on 3 cores but only use 60% of the cores and assigne the remaining 40% for ennemis AI calculations "

jeofthevirtuoussand
Автор

I have been following along with this series while writing in Rust over C++ to see how things can compare. Until this series everything on the Rust side has been matching the C++ performance if not a somewhat better. (In comparison to the laptop, my desktop PC with an i9-9900k gets about 15ms where the laptop gets about 60ms for single-threaded).

One thing Rust suffers from here is being able to mutate simple structures in an async context. A mutex or rwlock is required to be able to do what is asked of the multithreading unless allocating temporary buffers (one for both the image data and accumulation data). In an unsafe context it would be a lot easier but unfortunately Rust lacks a lot of things for async including some unsafe items. SyncUnsafeCell has yet to be stabilized.

So from here on out I guess I'll stick with the single-threaded and see how the performance goes. Would rather do that than clone two large vectors on every iteration. Just my two cents from outside of C++ :)

ezpzgamez
Автор

std::iota is the "fancy function" you're avoiding to generate sequences, fyi ;)

lithium
Автор

allocating a vector of numbers going from 0 to width and height made me very sad altough I get that this is for the sake of simplicity

for anyone interested though, here are some tips

a more proper way to go about this would be to either implement a custom range iterator (look up legacy iterator on cppreference) or use std::ranges::iota_view which is roughly equivalent to python's `range()` or rust's `x..y` thingy

you can also just avoid using parallel for_each, and instead split work for multiple threads by giving them responsibility over equally divided ranges of scanlines. this is pretty straightforward to implement and should yield good enough perfomance

ovi
Автор

7:07 I want multithreading, at my table, until tomorrow!

anime_erotika
Автор

hello, thank you for you video, it looks very useful, however I have a problem; I have noticed that my raytracer doesn't gain any performance from applying your changes, it even gets slightly worse, and when I look at my processor usage using htop, only one of my cores is being used. I am using linux and compiling using g++ through cmake, is there some flags I could use to actually make it multithreaded?

ups_
Автор

I actually had an assignment where we made a raytracer recently. Kind of funny that I also used std::for_each which I had not heard of before. The only difference was that I just looped over 1 vector containing each pixel index rather than an inner and outer loop.

jumponblocker