5 tips for reading large CSV files faster

Показать описание

5 Tips for Reading big CSV Files in Python with Pandas and pyarrow

In this video, we'll see how you can efficiently read large CSV files in Python with pandas and the pyarrow library. Pandas 2.0 added support for Apache Arrow and the pyarrow library.

I compare the C engine with the pyarrow engine for the read_csv function in pandas, and I also compare using the three dtype-backends: numpy, numpy_nullable, and pyarrow.

In my example, a few tweaks make reading my sample 4GB file go from 28 seconds down 6 seconds. Finally, I show how to use the Parquet file format and how this can further decrease the reading time to less than 1 second.

👍 Please like if you found this video helpful, and subscribe to stay updated with my latest tutorials. 🔔

🔖 Chapters:
00:00 Intro
00:40 Apache Arrow
02:12 Large CSV file
03:10 Tip 1: Keep your CSV files compressed
04:17 Tip 2: Use the pyarrow library with pandas
08:29 Use the pyarrow engine with read_csv()
10:59 Tip 3: Store the DataFrame in the Parquet format
14:21 Tip 4: Only read the columns you need
16:52 Tip 5: Only read the rows you need
19:25 More about Apache Arrow
21:39 Outro

Video links:

🐍 More Vincent Codes Finance:

#pandas #arrow #pyarrow #python #parquet #bigdata #csv #research #researchtips #jupyternotebook #vscode #professor #pandas #finance #datascience #dataanalytics #dataanalysis #duckdb #polars