Parquet vs Avro

Показать описание

In this video we will cover the pros-cons of 2 Popular file formats used in the Hadoop ecosystem namely Apache Parquet and Apache Avro

Agenda:
Where these formats are used
Similarities
Key Considerations when choosing:
-Read vs Write Characteristics
-Tooling
-Schema Evolution

General guidelines
-Scenarios to keep data in both Parquet and Avro

Avro is a row-based storage format for Hadoop. However Avro is more than a serialisation framework its also an IPC framework
Parquet is a column-based storage format for Hadoop.

Both highly optimised (vs pain text), both are self describing , uses compression

If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.
If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.

Finally in the video we will cover cases where you may use both file formats

Рекомендации по теме

Комментарии

Succinct!!. This is how an explanation of any topic must be. Kudos...

helsonkumar

I would've liked slides associated with the talking points, but this talk was very informative and thoughtfully laid out.

Thanks for that!

SamAndPatrick

Important points noted from above:
1.Parquet and Avro serializations inherit the advantages and disadvantages that follow up with column based and row based structures.
2.Again column based so Parquest is best used for MPP like Impala, Azure datawarehouse, AWS Redshift, used by cloudera a lot.
3.Schema evolution of data is supported by both, Avro is more flexible in being more fluid, eg Parquet only allows schema changes/new column addition when adding new rows, cannot add column to the old data .You can change avro columns anyway you like, add new column to old data as well.
4.Analytical = Parquet, few columns and a lot of analytical functions, column based
Avro = ETL, selecting more or all columns for ETL processing, row based

I have a question is Avro good for OLTP applications?

skms

Very clear and concise explanation of Avro and Parquet!

atishandhare

Very Helpful and clear explanation....

garimasingh

Was wondering if you could give similar comparison with ORC file format. Thanks.

vishakha

indeed very Helpful!! Waiting for ORC comparison

pravinpathak

Great talk!
I set up a spark-cluster with 2 workers. I save a Dtaframe using partitionBy ("column x") as a parquet format to some path on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status - unable to specify Schema ... Any Suggestions?

djibb.

Very good explanation, its helping a lot

dhruvarveti

Thanks can you add a presentation on parquet vs CarbonData...

brijoobopanna

Really Helpful content. Looking forward for ORC comparison.

gaganparnami

Can you please give compare between ORC and Parquet with one example? Thanks!

dharmeswaranparamasivam

Parquet vs Avro

Row Format vs Column Format | Why Parquet is better than Avro | Why Columnar formats are preferred

Difference between Avro, Parquet and ORC file formats #Hadoop

An introduction to Apache Parquet

Parquet vs Avro vs ORC | HDFS | File Formats | Interview Question

File Formats [Row based vs Columnar Format] #parquet #avro #orc

Parquet File Format - Explained to a 5 Year Old!

Differences AVRO vs Protobuf vs Parquet vs ORC, JSON vs XML | Kafka Interview Questions

Explaining the Row vs. Columnar Big Data File Formats (AVRO | PARQUET | ORC) (Part - 2)

Parquet vs Avro

What is Apache Parquet file?

Avro vs Parquet

Parquet vs Avro vs ORC | HDFS | File Formats | Interview Question

ORC vs Parquet file format | Hive Interview questions and answers | Session 2 - Trendytech

Parquet file, Avro file, RC, ORC file formats in Hadoop | Different file formats in Hadoop

Avro vs Parquet | Hive Interview questions and answers | Hive File formats | Session 3 - Trendytech

File Formats: Big Data- Parquet, Avro, ORC | The Data Channel

Big Data File Format Performance Comparison [CSV Vs JSON Vs AVRO vs PARQUET]

Avro vs Parquet | Spark Hadoop Interview question

What is AVRO Format and why it's used?

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Spark Scenario based interview questions, Difference between orc parquet and avro files #ORC #Avro

Parquet vs Avro vs ORC

Avro vs ORC vs Parquet file format | Hive Interview questions and answers | Session 4 - Trendytech

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io