Insert | Update | Delete On Datalake (S3) with Apache Hudi and glue Pyspark

preview_player
Показать описание
Note: You can load your views also based on schedule. ideally all your updates can happen on HUDI tables and then you may load your Athena view based on Schedule

All material and code can be found
Рекомендации по теме
Комментарии
Автор

what if the history data volume of parquet file is in TBs and there is an update event of 2-3 records of historical data. Does it make sense to purge entire set of parquet files and then repopulate it? Do we have any other efficient way to handle updates?

AayushShah-jf
Автор

Replacing entire target parquet files after deleting will be very expensive.Basicaly its like truncate and reload every time

agbtttt
Автор

Do we really need to delete the parquet file?

ramanagioe
Автор

Good video, nicely explained, one constructive feedback, too much use of the word "essentially"

abhijithalder
Автор

@soumil, could you clarify . how the data movement happens between Hudi folder and target S3 folders(parquet files) . Do we need to do ahard delete ?

ganeshsundar
Автор

thanks for this video, nicely explained

kapiljoshi