Row Group Size in Parquet: Not too big, not too small

preview_player
Показать описание
In this video, we'll learn all about row group sizes in Apache Parquet. With the help of DuckDB, we'll query files with different sizes and see how they fare for different queries.

#apacheparquet #dataengineering #duckdb

00:20 Intro to different files
00:31 Import DuckDB
00:43 Run query function
00:53 Basic count query
01:33 Performance of all files for count query
02:05 Looking into row group metadata for each file
02:35 Average salary query
02:53 Performance of all files for average salary query
03:10 htop output for each query to analyze CPU usage
05:28 Average value query
06:03 Performance of all files for average value query
06:20 Compute the number of row groups matched for each file for the average value query
06:56 Conclusion
Рекомендации по теме
Комментарии
Автор

TIL about GROUP BY ALL... <confident alcoholic meme goes here>

Thanks for the video! Found it after writing a parquet compactor script to allow spark efficiently process AWS VPC flowlogs (which are tons of files ranging from 1 to millions rows per group).

gradkb
Автор

Nice video, man! I like the approach focusing towards understanding the relationship between the group sizes and the current machine they're being used for (how many CPUs are available etc). Cheers!

DiogoBaeder
Автор

00:20 Intro to different files
00:31 Import DuckDB
00:43 Run query function
00:53 Basic count query
01:33 Performance of all files for count query
02:05 Looking into row group metadata for each file
02:35 Average salary query
02:53 Performance of all files for average salary query
03:10 Looking at htop output for each query to analyze CPU usage
05:28 Average value query
06:03 Performance of all files for average value query
06:20 Compute number of row groups matched for each file for average value query
06:56 Conclusion

learndatawithmark