File Formats [Row based vs Columnar Format] #parquet #avro #orc

preview_player
Показать описание

Follow me on Linkedin

Instagram

Row Based Format

The Data is stored in rows such as:
SequenceFile, MapFile, Avro Datafile.
In this way, if only a small amount of data of the row needs to be accessed, the entire row needs to be read into the memory.
Delaying the serialization can lighten the problem to a certain amount, but the overhead of reading the whole row of data from the disk cannot be withdrawn.
Row-oriented storage is suitable for situations where the entire row of data needs to be processed simultaneously.

Column Based Format
The entire file cut into several columns of data, and each column of data stored together:
Parquet, ORCFile.
The column-oriented format makes it possible to skip unneeded columns when reading data, suitable for situations where only a small portion of the rows are in the field.
But this format of reading and write requires more memory space because the cache line needs to be in memory (to get a column in multiple rows).
At the same time, it is not suitable for streaming to write, because once the write fails, the current file cannot be recovered, and the line-oriented data can be resynchronized to the last synchronization point when the write fails, so Flume uses the line-oriented storage format.

Data-bricks hands on tutorials

Azure Event Hubs

Azure Data Factory Interview Question

SQL leet code Questions

Azure Synapse tutorials

Azure Event Grid

Azure Data factory CI-CD

Azure Basics

Data Bricks interview questions
Рекомендации по теме
Комментарии
Автор

This is my first vedio in your channel, learned new file formats

myspace
welcome to shbcf.ru