5 Reasons Parquet Files Are Better Than CSV for Data Analyses | PyData Global 2021

preview_player
Показать описание
5 Reasons Parquet Files Are Better Than CSV for Data Analyses
Speaker: Matthew Powers

Summary
Parquet files are well supported by most languages / libraries, are easier to work with, and typically more performant than CSV files. This talk summarizes the main benefits of Parquet files and shows how they’re faster with benchmarking analyses. You’ll also learn how to convert CSV files to Parquet.

Description
5 reasons Parquet files are better than CSV:

schema - examine how the schema is embedded in the file metadata leveraging PyArrow
file sizes - compare file sizes when identical data is written to CSV and Parquet
columnar file format - examine performance benefits from leveraging column pruning to skip data
predicate pushdown filtering - understand how to query row group metadata with PyArrow and how to skip entire row groups based on column metadata
immutable - why immutable file formats are better
How to convert CSV files to Parquet with Pandas, Dask, and PySpark. Will show how to convert a single file or multiple files in parallel.

When to use CSV files and when to avoid them.

Matthew Powers's Bio
Powers is a tech evangelist at Coiled.

He used Spark / PySpark for 6 years and is now help devs understand when Dask is a better fit.

He's written two books, has a popular blog, and regularly contributes to open source codebases.

In a past life, he passed all three CFA exams and worked in finance.

PyData Global 2021

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Рекомендации по теме
Комментарии
Автор

My goodness this was SO helpful! Thank you!

lindajackson