Nico Kreiling: Raised by Pandas, striving for more: An opinionated introduction to Polars

Показать описание

Pandas is the de-facto standard for data manipulation in python, which I personally love for its flexible syntax and interoperability. But Pandas has well-known drawbacks such as memory in-efficiency, inconsistent missing data handling and lacking multicore-support. Multiple open-source projects aim to solve those issues, the most interesting is Polars.

Polars uses Rust and Apache Arrow to win in all kinds of performance-benchmarks and evolves fast. But is it already stable enough to migrate an existing Pandas' codebase? And does it meet the high-expectations on query language flexibility of long-time Pandas-lovers?

In this talk, I will explain, how Polars can be that fast, and present my insights on where Polars shines and in which scenarios I stay with pandas (at least for now!)

Рекомендации по теме

Комментарии

Good to see that the "Things I miss in Polars" list has gotten much smaller with the integration by scikit-learn and hvplot.

gregorywpower

Nice presentation. Discovered polars only recently and already like it. Feels way more lightweight than spark and I don't often need a whole cluster to compute stuff.

flwi

What does it mean by typing efficiency?

chndrl

Nice seing "the industry" apply vertical scaling stuff that has been done in games (for world data) for at least the last 20-30 years (since cache fetch fail is order of magnitudes more expensive than a cache normal read and SIMD/MMX intructions are available on x86.

I thought that compilers took care of these processor "bowels" stuff since thewlast two decades, and probably they did until now.
It seems today the data sets are so huge that we must micromanage memory access and vertical paralelism explicitelly again. For a litle while at least.

So it in hindsight, explicit optimisation of code data access patterns was outrunned by processor might, and then was further rendered obsolete by multi-core CPUs:
Iin theory, of course, because good usage of parallelism is still in its infancy.

But now it seems data set size growth has outrunned CPU growth 🤣

Low level 3D engine game programmers, you have a whole new market opening!

javierbenito

Nico Kreiling: Raised by Pandas, striving for more: An opinionated introduction to Polars

Nico Kreiling: Raised by Pandas, striving for more: An opinionated introduction to Polars

STOP Using Pandas. Use Polars Instead! #shorts

#datalift: Visualizing and teaching data and data science

data2day 2018 – PyData Workflow mit Jupyter Lab (Nico Kreiling)

Thomas Bierhance: Polars - make the switch to lightning-fast dataframes

Python Meeting Düsseldorf - 2023-06-07 (Alle Vorträge)

What polars does for you — Ritchie Vink

Joris Van den Bossche & Patrick Hoefler: Pandas 2.0 and beyond

Juan Luis Cano Rodríguez - Expressive & fast dataframes in Python with polars | PyData Global 2...

polarIFy: Automatically Transform Complex Python Methods to Polars Expressions - Bela Stoyan

Juan Luis- Expressive and fast dataframes in Python with polars | PyData NYC 2022

Cloud + Forsyth- Ibis- Expressive analytics in Python at any scale | PyData NYC 2022

Pietro Battiston - You don't need n dimensions when you have pandas

Ritchie Vink Polars; done the fast, now the scale PyCon 2023

Tutorials - Matt Harrison: Getting Started with Polars

Learning Polars for Data Analysis? Start Here!

Carsten Binnig: Towards Learned Database Systems

Build CLI Tools in Rust with Clap for Easy Distribution

Guillem Borrell: Most of you don't need Spark. Large-scale data management on a budget with Pyt...

Robin Raymond: Rusty Python - A Case Study

Beyond Pandas: lightning fast in-memory dataframes with Polars - Alberto Danese

Speeding Up Your DataFrames With Polars | Real Python Podcast #140

Joris Van den Bossche Apache Arrow Connecting and accelerating dataframe libraries across the PyData

Peter Wang: Rethinking Open Source in the Era of Cloud & Machine Learning | PyData Berlin 2019