Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

preview_player
Показать описание
We'll be covering data lakes, parquet file format, data compression and shuffle!

Рекомендации по теме
Комментарии
Автор

This channel is gold for any young data engineer. I wish I could pay you but you're probably already swimming in enough data :D

alonzo_go
Автор

Zach! We just started our project where we will be transferring our data to Data Lake in parquet! This is a very timely video. Awesome job, as always!

nobodyinparticula
Автор

Great lesson Zach! I have always wondered what the hell a Data Lake is. Great explanations and super easy to understand!

justinwilkinson
Автор

great video Zach, awesome content I learnt a Lot. Can you please make a video or share some content about why we should avoid shuffling, shuffling issues and ways to fix it?

rohitdeshmukh
Автор

Zach, I watched this while going office, and I loved the way, learnt hell about lot of things.Thanks for it

vivekjha
Автор

Great and insightful lessons Zach, just high quality content! Your community of loyal DEs is growing :) Keep up!

theloniusmonkey
Автор

Awesome video man! Just discovered your channel and excited to see more like this

andydataguy
Автор

Need more of these videos, beginer friendly💡

muhammadzakiahmad
Автор

Thanks Zach, the practical you showed helped me learn a lot. Can you please tell if I do daily sorted inserts into my iceberg table from my OLTP system using an ETL pipeline, will Iceberg consider that instance 'exclusive' and compress store it or will it look for common columns in existing data files as well and then compress?

ManishJindalmanisism
Автор

Wow - I learned so much from this video - Amazing! Thank you for sharing.

qculryq
Автор

amazing class Zach! keep going, thxxx

murilloandradef
Автор

Its great Video Zach, thoroughly Enjoyed It

vivekjha
Автор

@zach Thanks for this informative video. I have one question. You mentioned about sorting the data on low cardinality columns and then moving towards high cardinality for better RLE which makes sense to get more compressed data. But on the read side taking an example of ICEBERG we generally try to filter data on high cardinality columns and hence use those columns in sorting the data so that we read fewer data and predicate pushdown will really help in reading very small subset of data. Now both these settings contradict each other, on one side we get smaller data but on the other side we are more concerned about using sorting on high cardinality columns.

atifiu
Автор

Wow Amazing content Zach
Thank you so much

srinubathina
Автор

Wow the way people push vc is creative now good video.

JP-zzql
Автор

I have a question, during the whole video you've been dealing with historical data and moving it, what about new data received, how do you deal with it ? do you insert it into some random table then update your iceberg table using some crone jobs or do you insert it directly into iceberg and how?

LMGaming
Автор

Hello Zach, thanks for the content, after May, when is the next bootcamp?

pauladataanalyst
Автор

Casually ending the gender debate 😂 good video sir! Very informative

zwartepeat
Автор

This is amazing . You are a fabulous teacher . Had a question on replication. Is the replication factor not a requirement any more in modern cloud data lakes ?

thoughtfulsd
Автор

The tables you are using for your sources... Are those iceberg tables which are really just files and folders in s3 under the hood, placed there before the training? I'm just confused where the raw is coming from and what it looks like.

YEM_