filmov
tv
File Formats [Row based vs Columnar Format] #parquet #avro #orc

Показать описание
Follow me on Linkedin
Row Based Format
The Data is stored in rows such as:
SequenceFile, MapFile, Avro Datafile.
In this way, if only a small amount of data of the row needs to be accessed, the entire row needs to be read into the memory.
Delaying the serialization can lighten the problem to a certain amount, but the overhead of reading the whole row of data from the disk cannot be withdrawn.
Row-oriented storage is suitable for situations where the entire row of data needs to be processed simultaneously.
Column Based Format
The entire file cut into several columns of data, and each column of data stored together:
Parquet, ORCFile.
The column-oriented format makes it possible to skip unneeded columns when reading data, suitable for situations where only a small portion of the rows are in the field.
But this format of reading and write requires more memory space because the cache line needs to be in memory (to get a column in multiple rows).
At the same time, it is not suitable for streaming to write, because once the write fails, the current file cannot be recovered, and the line-oriented data can be resynchronized to the last synchronization point when the write fails, so Flume uses the line-oriented storage format.
Data-bricks hands on tutorials
Azure Event Hubs
Azure Data Factory Interview Question
SQL leet code Questions
Azure Synapse tutorials
Azure Event Grid
Azure Data factory CI-CD
Azure Basics
Data Bricks interview questions
Комментарии