Apache Iceberg Tutorial: Learn the Problem & Solution Behind Iceberg's Origin Story

Показать описание

This Apache Iceberg 101 Course will cover the origin story of the data lakehouse table format. We will explore what Hive is, and why it was beneficial. We will also discuss the problems associated with Hive, as well as the solution that Apache Iceberg provides.

Hive is an open source data warehouse system developed by Facebook in 2008. It was created to provide an efficient way to manage large amounts of data stored in Hadoop clusters. The system was designed to make it easier for developers to write queries and manipulate data without having to learn a new programming language. Hive allowed users to work with structured data using SQL-like commands, making it easier for them to access their data from different sources.

The benefits of Hive were numerous, including its scalability, cost-effectiveness, and ease of use. However, there were some drawbacks associated with it as well. One issue was that Hive wasn’t able to handle unstructured or semi-structured data very well. This caused problems when users needed to access their data from different sources or when they wanted to use more complex queries than what Hive could provide. Another issue was that Hive had a steep learning curve, making it difficult for new users or those who weren’t familiar with SQL-like commands to get started quickly and easily.

In order to address these issues, Apache Iceberg was created as an open source alternative to Hive in 2016. Apache Iceberg is a Data Lakehouse engine that provides better support for unstructured and semi-structured data than what traditional Data Warehouses provide. It also makes querying and manipulating this type of data much simpler by providing a unified query language that works across multiple sources and systems. This makes it easier for developers and analysts alike to quickly get up and running without having to learn a new programming language or understand complex database architectures.

Apache Iceberg also provides performance benefits over traditional Data Warehouses due its ability to store large amounts of unstructured or semi-structured data in a more efficient manner than other systems can offer. By using columnar storage formats such as Parquet, Apache Iceberg can dramatically reduce the amount of time needed for I/O operations on large datasets while still providing fast query performance over the same datasets.

Connect with us!

Рекомендации по теме

Комментарии

[Notes Part-2]
Netflix tried to address the problems with the Hive table format, by creating a new table format.
The goals Netflix wanted to achieve with this format were:
- Table correctness/consistency
- Faster query planning and inexpensive execution – eliminate file listing and better file pruning
- Allows users to not worry about the physical layout of the data. Users should be able to write queries naturally and be able to get benefit or partitioning.
- Table evolution – Allow adding, removal of table columns, changes to partition schema.
- Accomplish all of these at scale.

With this we landed with the concept of iceberg.
"With Iceberg, a table is a canonical list of files."
Instead of saying a table is all the files in a directory, theoretically the files in iceberg table could be anywhere. Files could be in different folders and does not have to be in nicely organized directory system.
This is because iceberg is going to maintain the list of files in its metadata and the engine would leverage this metadata. This allows to get to the files faster.

sukulmahadik

[Notes Part-1]
What is a table format?

- A way to organize a dataset’s files to present them as a single table. Our datalake may contain millions of data files that represent multiple datasets. The analytical tools need to know which files are part of which dataset. This is the problem that table formats solve.
- It’s a way to answer the question “what data is in this table”?

Hive Table Format:

- Old table format.
- Abstracted the complexity of Map Reduce. Internally it converted SQL statements in to map reduce jobs.
- Created at Facebook
- Used “directories” on storage (HDFS/Object Storage) to indicate which files are part of which table. This format applies a simpler approach that says a particular directory is a table. All the contents of the directory are part of this table. Any subfolders would be treated as partitions of the table.

- Pros:
o Works with almost every engine since it became the de-facto standard and stayed so for long.
o Partitioning allows more efficient query access patterns than full table scans.
o File format agnostic.
o Atomically update a while partitions.
o Single, central answer to “what data is in this table” for the whole ecosystem.

- Cons:
o Small updates are very inefficient. Updating a few records wasn’t easy. A partition was a smallest unit of transaction. To update few records, the entire partition or entire table had to be swapped/rewritten.
o No way to atomically update multiple partitions. There is a risk that after one partition is changed and before we start work on the next partition, someone could query the table and read inconsistent data
o Multiple jobs modifying the same dataset do not do it safely., .i.e. Safe concurrent writes not possible.
o All of the directory listing needed for large table take a long time. Engines have to list all the contents of the directories prior to returning the results so that it can understand the metadata of the files that make up the table/partition.
o To filter out the files (pruning) required the engines to read those files. Opening and closing of all the files to check if they have the data of interest made it time consuming.
o Users of the datasets have to know the physical layout of the table. Without knowing the physical layout of the table, we might write inefficient queries that could lead to full table scans that are costly.
o Hive table statistics are often stale and data engineers have to keep executing Analyze queries to keep the stats up to date. Collecting stats needs more compute and stats are only fresh as often we run the analyze jobs.

sukulmahadik

Your voice is like an angel to fall asleep😇

zmihayl

Thank you for this Series! It’s great!

JimRohn-uc

[Notes Part-4]
Iceberg Design Benefits:
- Efficiently make smaller updates. Updates do not happen at the directory level now. Changes are made at the file level.
- Snapshot isolation for transactions. Every change to the table creates a new snapshot. All read are on the newest snapshot. Reads and writes do not interfere with each other and all writes are atomic. Read does not read a partial snapshot.
- Faster planning and execution. A lot of metadata is maintained at individual file, partition level allowing better file pruning while running the query. Column stats are maintained in the manifest files. These column stats are used to eliminate files.
- Reliable metrics for CBOs(vs hive). The column stats and metrics are calculated on write instead of frequent expensive jobs.
- Abstracting the physical, expose a logical view. Users don’t have to be familiar with the underlying physical structure of the table. Features like hidden partitioning, compaction of small files, table can change over time, ability to experiment with the table layout with breaking the table.
- Rich schema evolution support.
- All engines see changes immediately.

sukulmahadik

[Notes Part-3]
What iceberg is?
1) Table Format Specification:

It’s a standard for how do we organize the data around the table.
Any engine reading/writing data from iceberg table should implement this open specification.

2) A set of APIs and libraries of interaction with that specification (Java, Python API)

These libraries are leveraged in other engines and tools that allow them to interact with iceberg tables.

What iceberg is not?
1) Not a storage engine. (Storage options we could use are HDFS, object stores like S3)

2) Not an execution engine (Engine options we could use Spark, Presto, Flink etc)

3) Not a service. We do not have to run a server of some sort. It’s just a specification for storing and reading data and metadata files.

sukulmahadik

Great start to Iceberg series. Thanks a lot !

M

18:31 Doesn't the Hive Metadata Store already allows to reference the data files making up a table, and collect columns statistics?

galeop

Great Video...can you please advise where is the metadata information of Iceberg format stored for a table ?

reclaim_kashi_vishweshwar_

Hi thanks this is really a great information to start with Apache Iceberg. But I have a question, when modern databases are already doing it with so much advance technology to prune and scan the data, why would we need to store the data in files format instead of directly loading them to a table ?

santhoshreddykesavareddy

Could you please share the slides? Thanks in advance.

jaredcheung

Please tell me, how this is different or better from Databricks Lakehouse?

adityanjsg

Nice video but a little mistake :)
Java was used to develop Map Reduce jobs. Not JavaScript.
You probably know that, you just got them mixed up.

jeanj_

Apache Iceberg Tutorial: Learn the Problem & Solution Behind Iceberg's Origin Story

Apache Iceberg Tutorial: Learn the Problem & Solution Behind Iceberg's Origin Story

Intro to Apache Iceberg! Apache Iceberg Explained for Beginners!

Apache Iceberg on AWS with S3 and Athena [FULL COURSE IN 30MIN]

Apache Iceberg Tutorial: Learn How to Use Catalogs for Data Management

Apache Iceberg Tutorial for Beginners: Understanding Copy-on-write and Merge-on-read

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Apache Iceberg Fundamentals: Course #1 - Introduction

Is THIS the Best Modern Data Format?

Hands-on with Apache Iceberg on Your Laptop: Deep Dive with Apache Spark, Nessie, Minio, Dremio...

Apache Iceberg Explained: A Tutorial with Dremio #shorts

7 Best Practices for Implementing Apache Iceberg

What is Apache Iceberg?

Apache Iceberg in One Minute

Set Up and Use Apache Iceberg Tables on Your Data Lake - AWS Virtual Workshop

Understanding Apache Iceberg architecture | Starburst Academy

Apache Iceberg Explained: A Tutorial with Dremio #shorts

Getting Started With Apache Iceberg With Project Co-Creator Ryan Blue

The top 3 reasons to switch to Apache Iceberg

Discover Apache Iceberg: The Top 5 Features You Need to Know

Query Iceberg Metadata from Dremio with Apache Iceberg and Dremio - Tutorial

Apache Iceberg Overview May 2023 (Basics, Migration, Partitioning, Row Level Updates, Settings, etc)

Hands-On Intro to Apache Iceberg - 1- Setup and Overview

How Iceberg Tables Work In Snowflake | DEMO

Getting Hands on with Apache Iceberg - Setting up local Spark/Notebook Environment for Evaluation