Optimizing Incremental Ingestion in the Context of a Lakehouse

Показать описание

Incremental ingestion of data is often trickier than one would assume, particularly when it comes to maintaining data consistency: for example, specific challenges arise depending on whether the data is ingested in a streaming or a batched fashion. In this session we want to share the real-life challenges encountered when setting up incremental ingestion pipeline in the context of a Lakehouse architecture.

In this session we outline how we used the recently introduced Databricks features, such as Autoloader and Change Data Feed, in addition to some more mature features, such as Spark Structured Streaming and Trigger Once functionality. These functionalities allowed us to transform batch processes into a “streaming” setup without having the need for the cluster to always run. This setup – which we are keen to share to the community - does not require reloading large amounts of data, and therefore represents a computationally, and consequently economically, cheaper solution.

In our presentation we dive deeper into each of the different aspects of the setup, with some extra focus on some essential Autoloader functionalities, such as schema inference, recovery mechanisms and file discovery modes.

Connect with us:

Рекомендации по теме

Optimizing Incremental Ingestion in the Context of a Lakehouse

Optimizing Incremental Ingestion in the Context of a Lakehouse

Incremental Ingestion - Versatile Data Kit

Incremental Processing on Large Analytical Datasets - Prasanna Rajaperumal & Vinoth Chandar

8 – Thinking Through Data Models for Incremental Ingestion

Data Ingestion using Auto Loader

How To solve incremental or historical Load in Spark Interview Question June 2023

8.2 Incremental data load in Azure Data Factory #AzureDataEngineering #AzureETL #ADF

Coalesce 2024: How Amplify optimized their incremental models with dbt on Snowflake

Why is my Power BI refresh so SLOW?!? 3 Bottlenecks for refresh performance

121. Databricks | Pyspark| AutoLoader: Incremental Data Load

Near Real Time Analytics with Apache Spark: Ingestion, ETL, and Interactive QueriesBrandon Hamric Ev

DataToboggan's Session:- Incremental Data Loading into DeltaLake

Incremental Stream Query Merging (EDBT-2023)

2 ways to reduce your Power BI dataset size and speed up refresh

What is Data Pipeline | How to design Data Pipeline ? - ETL vs Data pipeline (2024)

Hassle-Free Data Ingestion into the Lakehouse

Degrading Performance? You Might be Suffering From the Small Files Syndrome

Tempura: a general cost-based optimizer framework for incremental data processing

Incremental Data Pipeline on Lakehouse Architecture | Watch Royal Cyber On-Demand Webinar!

3. Incrementally copy new and changed files based on Last Modified Date in Azure Data Factory

Scalable Incremental Index for Druid Dr Edward (Eddie) Bortnikov @ Verizon Media (English)

Scaling Genomics on Apache Spark by 100x with Henry Davidge (Databricks)

Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal (Paytm)

🔴 Live Demo | How to Configure Auto Loader in Databricks | LearnITEveryDay