Nico Kreiling: Raised by Pandas, striving for more: An opinionated introduction to Polars

preview_player
Показать описание
Pandas is the de-facto standard for data manipulation in python, which I personally love for its flexible syntax and interoperability. But Pandas has well-known drawbacks such as memory in-efficiency, inconsistent missing data handling and lacking multicore-support. Multiple open-source projects aim to solve those issues, the most interesting is Polars.

Polars uses Rust and Apache Arrow to win in all kinds of performance-benchmarks and evolves fast. But is it already stable enough to migrate an existing Pandas' codebase? And does it meet the high-expectations on query language flexibility of long-time Pandas-lovers?

In this talk, I will explain, how Polars can be that fast, and present my insights on where Polars shines and in which scenarios I stay with pandas (at least for now!)
Рекомендации по теме
Комментарии
Автор

Good to see that the "Things I miss in Polars" list has gotten much smaller with the integration by scikit-learn and hvplot.

gregorywpower
Автор

Nice presentation. Discovered polars only recently and already like it. Feels way more lightweight than spark and I don't often need a whole cluster to compute stuff.

flwi
Автор

What does it mean by typing efficiency?

chndrl
Автор

Nice seing "the industry" apply vertical scaling stuff that has been done in games (for world data) for at least the last 20-30 years (since cache fetch fail is order of magnitudes more expensive than a cache normal read and SIMD/MMX intructions are available on x86.

I thought that compilers took care of these processor "bowels" stuff since thewlast two decades, and probably they did until now.
It seems today the data sets are so huge that we must micromanage memory access and vertical paralelism explicitelly again. For a litle while at least.

So it in hindsight, explicit optimisation of code data access patterns was outrunned by processor might, and then was further rendered obsolete by multi-core CPUs:
Iin theory, of course, because good usage of parallelism is still in its infancy.

But now it seems data set size growth has outrunned CPU growth 🤣

Low level 3D engine game programmers, you have a whole new market opening!

javierbenito