Parquet vs Avro

preview_player
Показать описание
In this video we will cover the pros-cons of 2 Popular file formats used in the Hadoop ecosystem namely Apache Parquet and Apache Avro

Agenda:
Where these formats are used
Similarities
Key Considerations when choosing:
-Read vs Write Characteristics
-Tooling
-Schema Evolution

General guidelines
-Scenarios to keep data in both Parquet and Avro

Avro is a row-based storage format for Hadoop. However Avro is more than a serialisation framework its also an IPC framework
Parquet is a column-based storage format for Hadoop.

Both highly optimised (vs pain text), both are self describing , uses compression

If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.
If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.

Finally in the video we will cover cases where you may use both file formats
Рекомендации по теме
Комментарии
Автор

Succinct!!. This is how an explanation of any topic must be. Kudos...

helsonkumar
Автор

I would've liked slides associated with the talking points, but this talk was very informative and thoughtfully laid out.


Thanks for that!

SamAndPatrick
Автор

Important points noted from above:
1.Parquet and Avro serializations inherit the advantages and disadvantages that follow up with column based and row based structures.
2.Again column based so Parquest is best used for MPP like Impala, Azure datawarehouse, AWS Redshift, used by cloudera a lot.
3.Schema evolution of data is supported by both, Avro is more flexible in being more fluid, eg Parquet only allows schema changes/new column addition when adding new rows, cannot add column to the old data .You can change avro columns anyway you like, add new column to old data as well.
4.Analytical = Parquet, few columns and a lot of analytical functions, column based
Avro = ETL, selecting more or all columns for ETL processing, row based

I have a question is Avro good for OLTP applications?

skms
Автор

Very clear and concise explanation of Avro and Parquet!

atishandhare
Автор

Very Helpful and clear explanation....

garimasingh
Автор

Was wondering if you could give similar comparison with ORC file format. Thanks.

vishakha
Автор

indeed very Helpful!! Waiting for ORC comparison

pravinpathak
Автор

Great talk!
I set up a spark-cluster with 2 workers. I save a Dtaframe using partitionBy ("column x") as a parquet format to some path on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status - unable to specify Schema ... Any Suggestions?

djibb.
Автор

Very good explanation, its helping a lot

dhruvarveti
Автор

Thanks can you add a presentation on parquet vs CarbonData...

brijoobopanna
Автор

Really Helpful content. Looking forward for ORC comparison.

gaganparnami
Автор

Can you please give compare between ORC and Parquet with one example? Thanks!

dharmeswaranparamasivam