Koalas dataframe on SPARK = Pandas API supercharged!

preview_player
Показать описание
Ever wondered why your CPU computes Python not on all cores/threads?
Solve this problem for your data engineering and data science tasks, including AI.

Achieve python code execution at 100% on all your CPU threads - with Koalas.

Koalas: a pandas equivalent API that works on Apache Spark.
Transfer your "single-node" & "in-memory" pandas API to a SPARK execution engine (distributed), with additional SQL and ML capabilities. Without mastering pySPARK.

By the way: pySPARK is the next logical step after Koalas - to improve your coding skills.

Minimize execution time (speed improvements) for your data analysis tools.
Plus code for a simple pyArrow SPARK optimizations to convert pandas df even faster.

Bench-marked w/ my Win10 PC on a single AMD CPU and no Nvidia GPU (no cuda cores).

BUT:
Does it really make sense to implement Koalas on a single-node PC, running SPARK? Officially No.
Recommended is, of course, a decent SPARK cluster (Databricks cluster running Databricks Runtime). Check out selected clouds like AWS, Microsoft Azure or Google Cloud!

Personal note:
For 10-100 GB file sizes: performance of my specific applications increase significantly with Koalas.

Increase your python performance (especially speed) with Koalas on SPARK.

#code_in_real_time
#real_time_coding
#JupyterLab
#Python
#SPARK
#Koalas
#dataframe
#pandas
#PySpark
Рекомендации по теме