Row Group Size in Parquet: Not too big, not too small

Показать описание

In this video, we'll learn all about row group sizes in Apache Parquet. With the help of DuckDB, we'll query files with different sizes and see how they fare for different queries.

#apacheparquet #dataengineering #duckdb

00:20 Intro to different files
00:31 Import DuckDB
00:43 Run query function
00:53 Basic count query
01:33 Performance of all files for count query
02:05 Looking into row group metadata for each file
02:35 Average salary query
02:53 Performance of all files for average salary query
03:10 htop output for each query to analyze CPU usage
05:28 Average value query
06:03 Performance of all files for average value query
06:20 Compute the number of row groups matched for each file for the average value query
06:56 Conclusion

Рекомендации по теме

Комментарии

TIL about GROUP BY ALL... <confident alcoholic meme goes here>

Thanks for the video! Found it after writing a parquet compactor script to allow spark efficiently process AWS VPC flowlogs (which are tons of files ranging from 1 to millions rows per group).

gradkb

Nice video, man! I like the approach focusing towards understanding the relationship between the group sizes and the current machine they're being used for (how many CPUs are available etc). Cheers!

DiogoBaeder

00:20 Intro to different files
00:31 Import DuckDB
00:43 Run query function
00:53 Basic count query
01:33 Performance of all files for count query
02:05 Looking into row group metadata for each file
02:35 Average salary query
02:53 Performance of all files for average salary query
03:10 Looking at htop output for each query to analyze CPU usage
05:28 Average value query
06:03 Performance of all files for average value query
06:20 Compute number of row groups matched for each file for average value query
06:56 Conclusion

learndatawithmark

Row Group Size in Parquet: Not too big, not too small

Row Groups in Apache Parquet

Row Group Size in Parquet: Not too big, not too small

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Parquet File Format - Explained to a 5 Year Old!

An introduction to Apache Parquet

Apache Parquet: Parquet file internals and inspecting Parquet file structure

Row Format vs Column Format | Why Parquet is better than Avro | Why Columnar formats are preferred

Synapse Espresso: How to Use Row Group and Column Elimination in Serverless SQL Pool?

What is Apache Parquet file?

Google SWE teaches systems design | EP44: Apache Parquet

Spark: PART3: PARQUET FILE FORMAT|Columnar|Row|Optimization Technique|Parquet Design|Repetition

066 Parquet Another Columnar Format

The Apache Spark File Format Ecosystem

Apache Parquet Data Format (Learning Sessions)

DP-203: 07 - Common file types (PARQUET)

How are strings encoded in Apache Parquet?

Why You Should Care about Data Layout in the Filesystem - Vida Ha & Cheng Lian

Uwe L Korn - Efficient and portable DataFrame storage with Apache Parquet

Parquet Format at Twitter

Projection and Predicate pushdown in Apache Parquet

Looking under the hood of the parquet format

File Formats [Row based vs Columnar Format] #parquet #avro #orc

Fabric Parquet and Delta Tables - Season 7 Ep.2

PART3: PARQUET FILE FORMAT|Columnar|Row|Optimization Technique|Parquet Design|Repetition Definition