The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Показать описание

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is 'many small files', and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Комментарии

The best representation of Parquet file structure!! Simply Awesome!!

prabhumaganur

Great talk, Great Teach, Excellent Tutor! One of the best presentation I have ever viewed and listened.

robinjamwal

This content explained most of the thing and It is really amazing .

manishsingh

Great overview to address performance issues with storage layer design 👍

SunilBuge

Best Parquet File presentation I watch

lhok

Great, this makes me know more about Parquet. Thanks for the pre!

YinghuaShen-kwys

Awesome video - not too much extraneous or labored points. Thank you!

kehaarable

Awesome video with great content and explanation. Very very useful.

mallikarjunyadav

Impressive presentation well structured explanations.

AM-izgk

Amazing. All concepts really well explained.

flaviofukabori

Finally understood what parquet format, thanks.
So I have one small doubt, does it mean that footer metadata is nothing but schema details, like underlying table details? Like way to mention table name, column names? etc.
I'll also dig from my side, but just meanwhile ....

AmitParopkari

@databricks - what is the best practice to use or not use nested columns. For Example, I have struct of customer with Age, Gender, Name, etc attributes. Is it better to keep it as struct or separate into its own columns?

tadastadux

Seems the time and i/o needed before use the data in doing the sort first is not considered?

higiniofuentes

interesting how Parquet (columnar analytical focused) data can be optimized using dictionary-based compression and partitioning

salookie

I haven't watched this yet but for the sake of prioritizing when I do, how well does this topic apply to platforms and systems other than Spark?

chrisjfox

anyone who says parquet is columnar format is having just bookish knowledge

rum

Thanks for posting this presentation. Could you clarify something? How does performance improve when you compress pages only to decompress it again to read it? I'm sure I'm not understanding something, but not sure what.

spacedustpi

How is storing json/xml (not parquet) more efficient than csv? You literally store the "column names" in each "row" in xml/json (at least when stored in a text file) . Also, there is definitely the notion of a "record" in csv.

jeremygiaco

Bucketing explanation was not great. Rest was fantabulous.

thevijayraj

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Parquet File Format - Explained to a 5 Year Old!

An introduction to Apache Parquet

What is Apache Parquet file?

PySpark Tutorial : Understanding Parquet

Parquet Format at Twitter

This INCREDIBLE trick will speed up your data processes.

What is Parquet File Format?

Looking under the hood of the parquet format

✅ Why I use Parquet File as Data Scientist? #datascience #dataanalytics

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader

Row Format vs Column Format | Why Parquet is better than Avro | Why Columnar formats are preferred

CSV vs Parquet Format Performance Comparison Using AWS Athena

Apache Parquet Explained in 5 minutes

Parquet Files Explained in Under 60 Seconds!

What is Parquet? Simply Explained

Recent Parquet Improvements in Apache Spark

Apache Parquet: Parquet file internals and inspecting Parquet file structure

Apache Parquet and InfluxDB 3.0

Row Groups in Apache Parquet

Row Group Size in Parquet: Not too big, not too small

Apache Parquet Data Format (Learning Sessions)

The Apache Spark File Format Ecosystem

Why data format matters ? Parquet vs Protobuf vs JSON