FASTER Ray Tracing with Multithreading // Ray Tracing series

Показать описание

🧭 FOLLOW ME

📚 RESOURCES (in order of complexity)

💾 SOFTWARE you'll need installed to follow this series

CHAPTERS
0:00 - Optimization
1:25 - Multithreading and Rendering
7:05 - Making our code multithreaded

Welcome to the exciting new Ray Tracing Series! Ray tracing is very common technique for generating photo-realistic digital imagery, which is exactly what we'll be doing in this series. Aside from learning all about ray tracing and the math to goes into it, as well as how to implement it, we'll also be focusing on performance and optimization in C++ to make our renderer as efficient as possible. We'll eventually switch to using the GPU instead of the CPU (using Vulkan) to run our ray tracing algorithms, as this will be much faster that using the CPU. This will also be a great introduction to leveraging the power of the GPU in the software you write. All of the code episode-by-episode will be released, and if you need help check out the raytracing-series channel on my Discord server. I'm really looking forward to this series and I hope you are too! ❤️

This video is sponsored by Brilliant.

#RayTracing

Рекомендации по теме

Комментарии

Maybe a Matt Parker fenomena erupts from the internets and we get a 40, 832, 277, 770% improvement. Or maybe not, cause we are not starting with python.

FabricioSTH

This video seems made for me. I was on a huge time crunch so I had to implement a ray tracer with reflections, BVH, etc, in about 36 hours total. It took a lot of coffee but I got it done. It's reasonably performant but I rendered a similar scene using cycles in blender and it is simply so much faster. What takes blender seconds takes me minutes, even with mutlithreading, and I don't have "fancy" features such as texture mapping running yet

blackbriarmead

4:10 thats actually wrong, gpus dont have thousands of cores, what they have is bigger simd widths usually 64-256, cpus also have simd widths of 8-16 so you can actually turn your cpu into a gpu if you are willing to vectorize or use intrinsics.

Theawesomeking

In my own version of this, I initially tried grouping a chunk of rows per thread and got good improvements. But then I noticed that certain blocks would take longer to run if there were lots going on in the image, so you'd have 1 thread working alone when all others were finished. I ended up using a threadpool and allocating each thread in the pool to work on one pixel, once that pixel was calculated the thread would go back in the pool and pick up the next pixel to work on. This worked very well and keeps maxing out the cpu until there are fewer pixels left to calculate than than cores available to work on them.

I'd love to change the code to work on GPU, and I did try for a while to get Metal to work but just couldn't work it out…

bishboria

Isn't `std::execution::par` enforcing sequential execution, which is not required here? I believe simply switching to `std::execution::par_unseq` would be an instant speedup.

But ultimately, thread creation have a overhead, and creating exactly as many threads as there is logical core and distributing the work would be faster.

But then again, not all threads would have the same amount of work since some pixels take longer than others, so to fully saturate all threads for the whole frame it would be better to use a thread-stealing threadpool.

However, maybe exactly N threads (N: amount of logical cores) might still be faster even if not fully well balanced if you have distinct tiles with thread-local data for them for better cache locality...

Kazyek

Excellent challenge, cheers Cherno! Loving this series.
There's a lot of optimisation possible here - 2x faster (around 60ms/16.6fps to 30ms/33.3fps) is some way below what we'd expect from fully independent worker units (check: are they? include worker timer and look for normal/abnormal timing distribution), all this assuming maximum threads isn't set to 2, of course ;)
I'd also be checking the thread allocation process (hint: another, more 'direct' way?), and making sure the work is 100% optimally split up, and 100% optimally allocated to the maximum threads returned from hardware_concurrency() (though historically not 100% guaranteed to work (return 0), don't know if it's now fixed...been a while for me).

ChrisM

YO BELATED HAPPY BDAY MAN! Wish i came sooner, would have wished you on the day :( HAVE A GOOD ONE EVERYDAY

srisairayapudi

Superior performance can also be achieved with techniques other than multi-threading. In fact, threading can actually be slower when the synchronization effort outweighs the performance gains (see Amdahl's law).

First, notice that CPUs already have parallelism built into their instruction feed pipeline. Fetching / decoding / executing / writing back results can be performed in parallel for successive instructions if they don't depend on each others results. Rearranging commands in assembly can have crazy gains (but with C/C++ we usually don't dig that deep).

Second, there are dedicated SIMD instruction sets on modern CPUs that can perform the same operations for multiple inputs (256 / 512 bit wide registers) at once to increase data throughput (e.g. 8 or 16 float ops at once).

Third, avoiding allocation can save lots of compute, too. Preprocessing data only once upfront is very nice. And having smaller stack frames to allocate / destroy is also important. Using some static, rewritable cache memory that's owned by one thread can really help performance such that there's smaller stack frames (at the downside of non-threadsafe code).

And last, there are different CPU caching layers which have 1000x faster I/O delays. So fitting all the memory in a faster cache and constantly reusing it will skyrocket the performance. CPUs have great latency once the data is loaded into a register. Small and simple is fast.

Maybe this inspires some devs here to write faster programs. Cheers, have fun at optimizing 🤓👨🏻‍💻🏎️

marcotroster

Messing around trying to optimise the code a bit, I noticed that your implementation of Random::InUnitSphere() is wrong. It's biased towards values in the directions of the corners of the unit box surrounding the sphere (because it draws a sample from the unit box and then normalizes that sample to fit on the surface of a sphere).

Alkanen

Hey, thanks for the series. The first video kick started my learning process about path tracing. In my opinion the series was a little slow and I was eager to outpace it. So I wrote a Vulkan path tracer in rust and learned most things by doing them. now I'm writing my Bachelors Thesis about differentiable path tracing. Btw. Mitsuba3 is a great tool for learning about path tracing as well, especially if you don't want to deal with C++. Anyways Thanks for the inspiration.

theonetribble

I really appreciate this! I've always wondered how multithreading is implemented but always got stuck in the syntax. Are there any plans on showing how to set up rendering on a graphics card?

jfgh

Thank you very much for your effort you put in this video. I've learnt a lot from your tips.
Could you please make some videos about how to optimize in the case when the computation on one pixel related to pixels around (Example : convolution, Gaussian filtering...)
Once again, thank you and have a nice year !

manuntn

can you take a look at the issues and PR's on the walnut repo?
It has some serious problems right now

nathans_codes

I am not a programmer nor a developer but I am actually curious.
Would it be possible to say to the hardware:

" hey can you run raytracing in parallel on 3 cores but only use 60% of the cores and assigne the remaining 40% for ennemis AI calculations "

jeofthevirtuoussand

I have been following along with this series while writing in Rust over C++ to see how things can compare. Until this series everything on the Rust side has been matching the C++ performance if not a somewhat better. (In comparison to the laptop, my desktop PC with an i9-9900k gets about 15ms where the laptop gets about 60ms for single-threaded).

One thing Rust suffers from here is being able to mutate simple structures in an async context. A mutex or rwlock is required to be able to do what is asked of the multithreading unless allocating temporary buffers (one for both the image data and accumulation data). In an unsafe context it would be a lot easier but unfortunately Rust lacks a lot of things for async including some unsafe items. SyncUnsafeCell has yet to be stabilized.

So from here on out I guess I'll stick with the single-threaded and see how the performance goes. Would rather do that than clone two large vectors on every iteration. Just my two cents from outside of C++ :)

ezpzgamez

std::iota is the "fancy function" you're avoiding to generate sequences, fyi ;)

lithium

allocating a vector of numbers going from 0 to width and height made me very sad altough I get that this is for the sake of simplicity

for anyone interested though, here are some tips

a more proper way to go about this would be to either implement a custom range iterator (look up legacy iterator on cppreference) or use std::ranges::iota_view which is roughly equivalent to python's `range()` or rust's `x..y` thingy

you can also just avoid using parallel for_each, and instead split work for multiple threads by giving them responsibility over equally divided ranges of scanlines. this is pretty straightforward to implement and should yield good enough perfomance

ovi

7:07 I want multithreading, at my table, until tomorrow!

anime_erotika

hello, thank you for you video, it looks very useful, however I have a problem; I have noticed that my raytracer doesn't gain any performance from applying your changes, it even gets slightly worse, and when I look at my processor usage using htop, only one of my cores is being used. I am using linux and compiling using g++ through cmake, is there some flags I could use to actually make it multithreaded?

ups_

I actually had an assignment where we made a raytracer recently. Kind of funny that I also used std::for_each which I had not heard of before. The only difference was that I just looped over 1 vector containing each pixel index rather than an inner and outer loop.

jumponblocker

FASTER Ray Tracing with Multithreading // Ray Tracing series

FASTER Ray Tracing with Multithreading // Ray Tracing series

Multithreaded CPU Ray Tracer

Multi-thread Raytracing

I made it FASTER // Code Review

FAST Random in 3 LINES OF CODE // Ray Tracing series

Multithreading and Octree Optimization in a Ray Tracer

Multithreaded Ray Tracing

Simple Raytracer [C++, multithreaded]

Raytracing in One Weekend, tile-based multi-threading and quick-view

Finding and Fixing Slow Code // Ray Tracing series

Ray Tracing In Minutes - Tanki Zhang - CppCon 2019

What is SIMD? Abusing Vector Instructions Across Threads for Ray Tracing

Building Raytracing Accleration Structures Fast

Ray Tracing [C++ & SDL2] - PERFORMANCE (Episode 23)

Ray Tracing [C++ & SDL2] Faster Linear Algebra (Episode 25)

Ray Tracing [C++ & SDL2] - Multithreading - Bringing it all together (Episode 27)

High Performance Code: CPU Ray Tracing Take Two

Realtime CPU Raytracing C

Raytracing 871000 Triangles in Real Time... on Intel Integrated Graphics

qbRAY - Multithreading coming soon! #shorts

Did You Know? Multithreading in TracePro

Nvidia CUDA in 100 Seconds

CPU Path Tracing in C #raytracing #3d #programming #gamedev

Path Tracing // Ray Tracing series