The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

preview_player
Показать описание
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is 'many small files', and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме
Комментарии
Автор

The best representation of Parquet file structure!! Simply Awesome!!

prabhumaganur
Автор

Great talk, Great Teach, Excellent Tutor! One of the best presentation I have ever viewed and listened.

robinjamwal
Автор

This content explained most of the thing and It is really amazing .

manishsingh
Автор

Great overview to address performance issues with storage layer design 👍

SunilBuge
Автор

Best Parquet File presentation I watch

lhok
Автор

Great, this makes me know more about Parquet. Thanks for the pre!

YinghuaShen-kwys
Автор

Awesome video - not too much extraneous or labored points. Thank you!

kehaarable
Автор

Awesome video with great content and explanation. Very very useful.

mallikarjunyadav
Автор

Impressive presentation well structured explanations.

AM-izgk
Автор

Amazing. All concepts really well explained.

flaviofukabori
Автор

Finally understood what parquet format, thanks.
So I have one small doubt, does it mean that footer metadata is nothing but schema details, like underlying table details? Like way to mention table name, column names? etc.
I'll also dig from my side, but just meanwhile ....

AmitParopkari
Автор

@databricks - what is the best practice to use or not use nested columns. For Example, I have struct of customer with Age, Gender, Name, etc attributes. Is it better to keep it as struct or separate into its own columns?

tadastadux
Автор

Seems the time and i/o needed before use the data in doing the sort first is not considered?

higiniofuentes
Автор

interesting how Parquet (columnar analytical focused) data can be optimized using dictionary-based compression and partitioning

salookie
Автор

I haven't watched this yet but for the sake of prioritizing when I do, how well does this topic apply to platforms and systems other than Spark?

chrisjfox
Автор

anyone who says parquet is columnar format is having just bookish knowledge

rum
Автор

Thanks for posting this presentation. Could you clarify something? How does performance improve when you compress pages only to decompress it again to read it? I'm sure I'm not understanding something, but not sure what.

spacedustpi
Автор

How is storing json/xml (not parquet) more efficient than csv? You literally store the "column names" in each "row" in xml/json (at least when stored in a text file) . Also, there is definitely the notion of a "record" in csv.

jeremygiaco
Автор

Bucketing explanation was not great. Rest was fantabulous.

thevijayraj