Simplify and Scale Data Engineering Pipelines with Delta Lake

Показать описание

Online Tech Talk with Denny Lee, Developer Advocate @ Databricks

A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (“Bronze” tables), transformation/feature engineering (“Silver” tables), and machine learning training or prediction (“Gold” tables). Combined, we refer to these tables as a “multi-hop” architecture. It allows data engineers to build a pipeline that begins with raw data as a “single source of truth” from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake.

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

In this session you will learn about:
- The data engineering pipeline architecture
- Data engineering pipeline scenarios
- Data engineering pipeline best practices
- How Delta Lake enhances data engineering pipelines
- The ease of adopting Delta Lake for building your data engineering pipelines

See full Getting Started with Delta Lake tutorial series here:

Databricks

Рекомендации по теме

Комментарии

Wonderful Demonstration and very handy notebook.
Following are my assumptions.
1. Deltalake keeps multiple version of the data( like HBASE ) .
2. Deltalake takes care of the automicity for the user showing only the latest file if not specified otherwise.
3. Deltalake checks the schema before appending to prevent corruption of the table, this makes developers job easy, similar things can be achieved with manual effort like manually mentioning the schema instead of infering it.
4. In case of update it always overwrites the entire table or the entire partition(dataframes are immutable) .
Questions.
1. If it keeps multiple version is there a default limit for number of versions ?
2. As it keeps multiple versions so is it only for smaller tables ? for tables in terabytes wont it be a waste of space?
3. In relational DB data is tightly coupled with metadata/schema, so we can only get the data only from the table not the data files . But in hive / spark this is different. external tables are also allowed . Without having access the metadata, we can recreate the table . How it is handled in DeltaLake, because we have multiple snapshot/version of the same table, without the log/metadata will someone be able to access it? In hive/Spark multiple table with different tool ( hive, presto, spark) can be created on the same data. Can other tool share the same data with deltalake ?

KoushikPaulliveandletlive

If the streaming / batch notebook you demonstrated were being run in a workflow and and lets say100k rows have streamed in successfully, but then an error occurs and the job fails. As I understand it, the 100K rows and all other changes that occurred in the workflow would be automatically rolled back. Is this correct?

CoopmanGreg

Great demo... very useful for learning delta architecture

nithin

Can you help to share the steps on how to import the notebook from the github link to databricks community edition.

nithin

Simplify and Scale Data Engineering Pipelines with Delta Lake

Simplify and Scale Data Engineering Pipelines with Delta Lake

Simplify and Scale Data Engineering Pipelines with Delta Lake - Amanda Moran (Databricks)

Simplify and Scale Data Engineering Pipelines with Delta Lake

Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn

What is Data Pipeline? | Why Is It So Popular?

What is Data Pipeline | How to design Data Pipeline ? - ETL vs Data pipeline (2024)

Data Engineering SIMPLIFIED! | Skills required in 2022 to become a Data Engineer | Darshil Parmar

Learn Snowflake in 10 Minutes| High Paying Skills | Step by Step Hands-On Guide

Database vs Data Warehouse vs Data Lake | What is the Difference?

What Does a Data Engineer Do? Explained Simply

Secret To Optimizing SQL Queries - Understand The SQL Execution Order

Simplify Large-scale Data Management with a Unified Data Mover

Simplify ETL pipelines on the Databricks Lakehouse

Keeping It Simple and Scalable: quick production-scale data pipelines

Big Data In 5 Minutes | What Is Big Data?| Big Data Analytics | Big Data Tutorial | Simplilearn

Most annoying topics for Data Engineers Part #2: Stuck Data Pipelines!

Intermediate Level Skills for Data Engineering (2022)

Data Engineers Stop Hand Coding and Start Accelerating Your Analytics Projects!Michael Destein Tale

What is Snowflake? 8 Minute Demo

LLM Explained | What is LLM

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

100+ Docker Concepts you Need to Know

Microservices Explained in 5 Minutes

Simplify and Scale Video Creation with Maker for Teams