PyArrow vs. Pandas for managing CSV files - How to Speed Up Data Loading | Better Data Science

preview_player
Показать описание
Do you find Pandas slow? Well, I do, at least for reading and writing CSV files. There’s an alternative that leaves your data analysis pipeline untouched, and it’s called PyArrow. It can speed up read/write times by around 7 times, and this video will teach you how to work with it.

FOLLOW BETTER DATA SCIENCE

FREE “LEARN DATA SCIENCE MASTERPLAN” EBOOK

GEAR I USE
Рекомендации по теме
Комментарии
Автор

pyarrow now supports datetime columns with the date32/64 scalar type and Date32/64array class!

Theenzo
Автор

Great tutorial! Please could you increase the font size when you code for future uploads? It would be greatly appreciated. Thanks

caesarHQ
Автор

Why saving to a CSV when using pyarrow? Why not use parquet files and save time and disk?
I mean write to parquet using pyarrow, skipping the conversion and read with pandas using pyarrow as the engine...
Edit: Should have seen the ful video before commenting. Looking forward to following videos.

Автор

Hi Dario, I found your description very informative, Thank you for the video. I got the installation and I am trying to import a CSV file, while importing the file the import is running perfectly but I am getting this error when calling the CSV file: module 'pyarrow' has no attribute 'csv' . Need your advice as what could be the possible problem

harshalverma
Автор

Hi I noticed that when pyarrow writes the csv it puts double quotes around strings. Is there anyway to avoid this. Please I have a time sensitive case and I would really appreciate your help. Thanks

CEOofTheHood
Автор

Is it only me? Pandas turned out to be faster after all!! It took less than a minute. While the PyArrow took more time to convert first. Since the conversion step is necessary to do then it's timing should be counted too!

omar_