Apache Iceberg Tutorial for Beginners: Understanding Copy-on-write and Merge-on-read

preview_player
Показать описание
This Apache Iceberg 101 Course #7 focuses on Copy-on-Write (COW) and Merge-on-Read (MOR) - two essential concepts in data lakehouse table formats. In this course, you will learn what Copy-on-Write and Merge-on-Read are, as well as what a delete file is and when to use MOR or COW.

Copy-on-Write is a process used in data lakehouse table formats where changes to the table are written to a new version of the table instead of modifying the existing version. This allows for the original version of the table to remain unchanged while new changes are applied. For example, if a user wants to add a new column to an existing table, instead of modifying the existing version of the table, a copy of it is created with the added column.

Merge-on-Read is another key concept used in data lakehouse table formats. This process involves merging multiple versions of tables when theyre read from storage by an application or query engine. The result from this process is a single view that contains all changes from different versions of tables. For example, when an application requests data from two different versions of a table, both versions are read and merged together into one view that contains all changes made to both versions.

In addition to Copy-on-Write and Merge-on Read processes, this Apache Iceberg 101 Course also covers delete files - files which contain information about rows that have been deleted from tables. Delete files can be used with either COW or MOR processes and allow users to keep track of rows that have been deleted without having to rewrite entire tables each time something needs to be removed.

When deciding whether to use Copy-on Write or Merge on Read processes, its important to consider how often data within tables needs to be modified or updated. If frequent updates need to be made, then COW might be more suitable since it allows for quick and easy modification without having to rewrite entire tables each time something needs changing. On the other hand, if large amounts of data need merging then MOR might be more suitable since it allows for multiple versions of tables can be merged together quickly and easily into one single view.

Connect with us!

Рекомендации по теме
Комментарии
Автор

Very helpful! Thanks for also explaining the two types of delete files.

cw
Автор

Great explanation! Thank you for this video!

xabrielcollazomojica
Автор

Hi Alex,
Very good Content and Explanation.

shyjukoppayilthiruvoth
Автор

Thanks for putting this presentation together, it's a great overview.

It's not clear from the video, how do we specify position versus equality deletes?

peterconnolly
Автор

Awsome Video !!
At 3:18 when explaining different delete format I have question regards to the implementation :
As the delete mode only accept MOR or COW, how exactly do I specify the delete operation to use Equality delete or Positional delete ??

kenhung
Автор

[Notes Part-2]
Setting the table for COW or MOR:

The following table properties must be set to either "copy-on-write" or "merge-on-read".
They all default to "copy-on-write"
Note that there's not even a table-wide setting - there's actually different settings depending on whether you want to delete using either approach to update by other approach or merge by either approach.

For update queries: write.update.mode
For delete queries: write.delete.mode
For merge/upsert queries: write.upsert.mode

We set all these properties under the TBLPROPERTIES of the table.
Each of these properties could have a different value depending upon what we want to use for each operation types.

We would also alter these properties after the table is created using the ALTER TABLE xyz SET TBLPROPERTIES(....)


When to use which write mode?

- COW: Daily batch jobs where write speed is not a priority and read time is a priority. Faster reads as there is no reconciliation to be performed. Most batch analytics workloads are WORM (Write once Read Many). For such workloads COW tables make more sense

- MOR (position deletes): Streaming and higher frequency batch jobs (hourly), where the write speed is very imp with minor cost to read times. Regular compaction needed.

- MOR (Eqiality deletes): Very write intensive jobs where position deletes are not fast enough. This has much larger cost to reading . Requires frequent compaction jobs to be run.

sukulmahadik
Автор

What it is actually done is Append on Write, and not Copy on Write. Because the file is written elsewhere and only the pointer changes to the file with the new raw.

zayedet
Автор

Thanks Alex for great explanation.
it is not clear for me what do delete files contain in case of update statement issued against table ?
do delete files will have post image of the rows for example or what will happen ?

thanks

SayedElhewihey
Автор

Which version of spark supports delete files?

ashmkrgao
Автор

1:40 why do you say that, in Hive, updating a row would imply re-writting all the files composing the affected partition? Why is not just the Parquet file that contains the updated row? I mean, why would the other Parquet files in the partition have to be re-written ?

galeop