The BEST library for building Data Pipelines...

preview_player
Показать описание
Building data pipelines with #python is an important skill for data engineers and data scientists. But what's the best library to use? In this video we look at three options: pandas, polars, and spark (pyspark).

Timeline:
00:00 Data Pipelines
01:11 The Data
02:32 Pandas
04:34 Polars
06:15 PySpark
09:15 Spark SQL

My other videos:

#python #polars #spark #dataengineering
Рекомендации по теме
Комментарии
Автор

If you enjoyed this video please consider subscribing and check out some of my videos on similar topics:

robmulla
Автор

These are phenomenal, I especially like these short 10-15min videos. Thanks a lot for sharing all these relevant and up to date topics!

anchyzas
Автор

One thing you said implicitly is quite important: the footprint of polars is waaayyyy smaller than pandas which feels like polars may be a good choice for edge or serverless computing. In those cases I often refrain from using pandas because of the resources needed and the startup time. I then end up doing funny stuff with dicts, classes, tuples… I‘m considering exploring polars for that.

riessm
Автор

Great video! Always curious about Spark and this gave a great overview of these 3 tools! 💡

joseortiz_io
Автор

Thanks for such awesome content. I love polars and been trying it since your video came out, it would be nice to see you use it to do a data exploration video :D

tonyle
Автор

Another great video! Thanks Rob! Looking forward to the next stream

fee-f-foe-fum
Автор

Hey Rob, huge fan of your work, keep rolling😀

shivayshakti
Автор

Rob, thank you! It's almost as if you read minds! This video sort of went above-and-beyond here! I'd been toying with trying a local session of Spark, and thanks to you, now have the impetus to give it a go!

DarthJarJar
Автор

Great introduction video! Thank you!
Looks like most of time for PySpark was to initiate the session itself, it creates once as far as I understand and the reuses for later GetOrCreate() function calls. But anyway, for bigger pipelines Spark will work faster.

arturabizgeldin
Автор

It was a great video and very useful. Adding Spark to the mix was just awesome! For next video, using duckdb and it's benefits vs polars or maybe duckdb alongside polars would be great! Founder of duckdb said that for most companies, it is enough. So testing and discussion on that claim would also be great! Duckdb is said to be using vector search. Discussion on how vector-search is faster or better would also be great. Thanks!

TheSiddhaartha
Автор

Hi Rob and thanks for the excellent work, I enjoy each of your videos!
I would be interested in a video explaining how to put several machine learning libraries pulled from GitHub in a row, for example: Object detection + Keypoints estimation + Person identification. Also, how to manage compatible library versions for all these repos that have different (incompatible) requirements.
Thanks!

jorislimonier
Автор

I like these type of videos as they clear
all confusion.

prashlovessamosa
Автор

Thanks for the educational content Rob

aminehadjmeliani
Автор

Thanks for the great video! I'd like to see a comparison with other distributed Python libraries, such as Modin. Thanks!

somerset
Автор

I really like your content. Absolutely grade A+

aabbassp
Автор

Great video! I have a Junior Data Engineer interview coming up and I'm stressed. I don't have any previous working experience in this field. I feel somewhat confident in SQL and Pandas and have been practicing on Strata Scratch. I absolutely hate the Data Structures and Algorithms type of questions like the ones on leetcode and I can't even answer the easy ones. I'm worried that my interview will have those kinds of coding problems. My initial goal was to become a Data Analyst but decided to apply for Data Engineer since it is a junior position.

chillvibe
Автор

excellent. Great contents.
Thanks for sharing..

peterluo
Автор

Hey Rob, this was a great video - clear and concise. Could you explain how you would set up an analysis that would run regularly as the data changed? For example, the flight data you used in this example, let's say that was updated once a week and you needed to update the aggregate stats, and maybe even track the aggregates over time. Thanks!

steve_dunlop
Автор

very good video. Can you please make more advanced polars videos? I have start switching to polars from pandas and I really want to learn more about how to do more advanced things with them.

Alexander-pktu
Автор

Hi Rob, wonderful video as always! Can you make a video on how to deploy a trained machine learning model (maybe the XGBoost forecaster you made) using Docker?

Arkantosi