Apache Parquet: Parquet file internals and inspecting Parquet file structure

Показать описание

In this video we will look at the inernal structure of the Apache Parquet storage format and will use the Parquet-tool to inspect the contents of the file. Apache Parquet is a columnar storage format available in the Hadoop ecosystem

Related videos:

Рекомендации по теме

Комментарии

Best explanation of parquet file and columnar file format, I came across so far. Thank you very much

srividyaus

Interesting video showing a single RowGroup...

You present well, and clearly have a solid grasp of the Parquet file format.

If you're interested in preparing a sequel to your video...
... considering showing a diagram of MULTIPLE row groups, each stored on a different disk in a different node in a cluster, so that a RowGroup represents the "sharding" (splitting across rows in the logical representation of a table) of a logical table and distributing shards-as-RowGroups on DIFFERENT nodes.

Then you could explore what happens during a query like "What is average square ft in ZIP Code 60542?"

This query can & will be PARALLELIZED into 1 query on each disk where a portion of the larger (logical) table has been stored.

What's COOL about parquet is this:
- in a ROW-based storage format to get the ZIP from a single record I have to read EACH row, FIND the ZIP field and return it.
- therefore in a row-based "shard" containing (say) 10, 000 rows across (say) 10 disks (so 1, 000 rows per disk) I have to make 10, 000 READS across different regions of my disk... VERY INEFFICIENT just to get a SINGLE field (sqft) :-(

- in a COLUMN-based storage format I simply have to make 1 single read, starting with where the sqft data begins, and stopping where this field ends. And in a SINGLE read (NOT 1, 000) I have ALL the sqft values in that shard representing those rows in my larger (logical) table :-)
- MEANWHILE on my other (say 10) disks also containing this (logical) table, there are also only 1 READ per disk,

The result?
Instead of 10, 000 reads across 10 disks just to get 10, 000 measely values of sqft to average...
... the parquet format lets me make only 10 reads and get the same 10, 000 values :-O

Illustrate THIS in your next video ;-)
You'll be a hero :-)
-Mark in North Aurora IL ...

markevogt

Great overview! Thanks for taking the time to record it!

flwi

best explanation of columnar storage format

abhijeetzagade

Nice video but i dont see any row group tuning parameter directly. It is tuned via block.size itself. Is my understanding correct?

aniruddhnathani

Awesome talk. Melvin, can you share your slides? via Slideshare or something.

rogermenezes

Hi all, I am searching a way to load the parquet file but not in one go. Want to load in parts . How can i achieve this in Java . Any Implementation reference will be highly appreciated. I have gone through few articles but not up to the mark.

sunilmali

Why the parquet store the data as row layout (row group)? Does it store data as column side by side?

brianz

Great talk!
I set up a spark-cluster with 2 workers. I save a Dtaframe using partitionBy ("column x") as a parquet format to some path on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status - unable to specify Schema ... Any Suggestions?

djibb.

What happens if i write a parquet file that has 2 row group??

__dio

Apache Parquet: Parquet file internals and inspecting Parquet file structure

Apache Parquet: Parquet file internals and inspecting Parquet file structure

An introduction to Apache Parquet

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Parquet File Format - Explained to a 5 Year Old!

What is Apache Parquet file?

Looking under the hood of the parquet format

What is Parquet? Simply Explained

Parquet Format at Twitter

PARQUET TOOLS - ALL ABOUT PARQUET FILE

Apache Parquet - 3 : Convert CSV to Parquet | Read Parquet file using Pandas |

What is Parquet File Format?

✅ Why I use Parquet File as Data Scientist? #datascience #dataanalytics

Apache Parquet Explained in 5 minutes

Parquet File Format | Apache Spark

Apache Drill SQL Queries on Parquet Data Whiteboard Walkthrough

What Why and How of Parquet Files

Apache Parquet - 1 : What is a Parquet file? | Parquet data format | Columnar Storage | Compression

Parquet File format

Advantages of PARQUET FILE FORMAT in Apache Spark | Data Engineer Interview Questions #interview

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Cut your file size in half with Parquet

What is Apache Parquet files

Row Groups in Apache Parquet

Row Group Size in Parquet: Not too big, not too small