Speed up your Rust code with Rayon

preview_player
Показать описание
Today we are learning how to easily parallelize your sequential Rust code with Rayon.

Chapters:
0:00 Intro
0:31 How to use Rayon
4:03 How Rayon works
6:04 Customization
6:16 Outro
Рекомендации по теме
Комментарии
Автор

I really like how your first example showed a situation where using multiple threads is actually slower. A lot of people/explanations talk about big Oh notation and performance as '(better big Oh || more threads) == more better'. But what often doesn't get mentioned is that it is highly dependent on the context and the amount of data you are working with.

Optimization without benchmarking isn't optimization.

mathijsfrank
Автор

I definitely appreciate you showing a case where it isn't actually faster. Really helps highlight the importance of bench-marking if you intend to actually optimize.

timwhite
Автор

Polars could be a good fit for the next video (Lightning-fast DataFrame library). You could even benchmark the same way you did in this video.

williamdroz
Автор

great video!

At 4:49 I had to check my glasses, then my connection, then I realized what you did :)

codeshowbr
Автор

Thankfully, I already used it for my weekend ray tracer. It's pretty easy and scales perfectly for longer tasks.

LtdJorge
Автор

The content was great. It was helpful to see that the parallel version can be slower in some situations. The stock footage was a bit distracting, especially the blurry bit (maybe that was an in-joke I didn’t get).

Perspectologist
Автор

It reminds me of a Java Stream's API parallelStream method

ИмяФамилия-хве
Автор

Mate, I fkn love your videos. Every time I am stuck, you have a video there to save the day

DavidAlsh
Автор

Curious what it does under the hood. If you wrote the parallelism yourself, for two cores, you would split the 200, 000 items into two arrays and assign 1 core to each array, which ought to have minimal overhead. But if it splits the 200, 000 items into 200, 000 tasks which have to be stolen, that is a lot of overhead per small item.

If you rewrote your iteration to be over chunks, and did counting over chunks, adding together at the end, would rayon perform better?

You could divide task into 2, 4, 8, 16, 32, 64, 128 chunks and see how much performance degrades. But I bet even 128 chunks, which would spread out well over most CPUs, would have a 1000x better ratio of overhead to benefit than 200, 000 individual tasks.

AlwinMao
Автор

Rayon is really amazing! It's actually incredibly performant.
I have even been comparing with loop parallelization in Fortran and C which can be done by a compiler such as GCC, with a lot less guarantees. I tested it to perform faster, even though especially modern Fortran has some interesting features as well, such as a `do concurrent` loop and also so called array programming features.
And apart from that, loop parallelization absolutely only works with the proper compiler flags, otherwise it does not.
I like the expressive functional programming way of Rust a lot more, where that problem does not exist and Rayon handles it so much smarter, and you can tRust it.
It also reminds me of Parallel LINQ in the C# programming language, which is similar, although obviously it cannot compete with Rust performance at all.
In this whole daylight, I would also like to mention NDArray, which is a really powerful crate for multidimensional array functionality. With things like these, I totally see a very serious place for Rust in both game development as wel as in scientific parallel computing. Actually amazing!

jongeduard
Автор

It would have been nice if you had gone into some of the deeper stuff in your discussion.

Under the covers you are adding complexity to your code.

It would have been interesting to see how Rayon handles locking and inter-thread communication since you are implementing the classic librarian / reader problem.
Also, even if you are not using multiple cores, depending on the task, you can also gain performance if you are doing some parallelism. (e.g a thread is blocked on a wait state so another thread could work while the first thread is waiting. )

michaelsegel
Автор

I ran a comparison of similar code with and without rayon. The non-parallelized code ran more than 2X faster. But it did not call collect() so it wasn't a perfect comparison. I wasn't able to adapt the rayon code to run without calling collect() first. I was able to change the non-parallel code so it would call collect and then iterate. This was a more meaningful comparison, and the parallelized code was about 10% faster. The task was to add one billion f64 numbers all equal to 1.0.

fsaldan
Автор

Another crate to cover would be Polars or DataFusion. Both are DataFrame libraries based on Apache Arrow. Polars's documentation is a bit sketchy for Rust atm, and DataFusion appears to prefer doing everything asynchronously.

Direkin
Автор

Great overview. I heard of rayon and got a good impression back then but this kind of concise insight is much easier on my brain.
I would appreciate such a treatment for Elementum once it's out.

TheLomsor
Автор

I could only calculate the performance benefits of rayon and xargs with benchmarking. Is there a deterministic way to calculate the performance benefits for the specific task beforehand? In my case, I have large chunked files and have like 30 CPU cores in the computation center. Whenever I use rayon and xargs together, the performance somehow drops. Let's say the task is creating a frequency table where each line is the count of a quantity in these large files.

Автор

What would be cool would be how to parallelize code across different CPUs (not threads of the same CPU) on the same machine, e.g. on a HPC cluster. In C you would use MPI for that. How would that work in Rust?

Metagross
Автор

On the second benchmark example, I think the error bars should have been called out. It compares 300 +/- 30 to 200 +/- 130 milliseconds.

isabelkaspriskie
Автор

Some day, they'll have chat-gpt3/4 integrated to show you how to fix your compiler errors, and/or fix them for you with a little "fix it" button. Then, Rust will truly be easy for all.

jeffg
Автор

So, if I understand correctly, in your example code, parallelizing it only made sense when you were processing a vector of at least a certain length. And it would've only made sense with a smaller vector if your filter operation had been more expensive, right? And obviously Rayon can't see how expensive your filter operation is, so Rayon can't make an educated guess about when the vector is long enough to justify parallelizing it.

In that case, wouldn't it make sense if Rayon would offer a method like > 2e6)` ?

EvertvanBrussel
Автор

I've been making benches with criterion which works fine but for small tests like this I had absolutely no idea this way of benchmarking existed lmao

oxey_