filmov
tv
Why Delta Lake is BETTER than Parquet

Показать описание
Demo as part of - Degrading Performance? You Might be Suffering From the Small Files Syndrome - Presentation from Data & AI Summit 2021, formerly Spark + AI Summit by @AdiPolak
No matter if your data pipelines are handling real-time event-driven streams, near-real-time streams, or batch processing jobs. When you work with a massive amount of data made out of small files, specifically parquet, your system performance will degrade.
A small file is one that is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3, Azure Blob, etc., there is minimum block size. Having a significantly smaller object file can result in wasted space on the disk since the storage is optimized to support fast read and write for minimal block size.
To understand why this happens, you need first to understand how cloud storage works with the Apache Spark engine. In this session, you will learn about Parquet, the Storage API calls, how they work together, why small files are a problem, and how you can leverage DeltaLake for a more straightforward, cleaner solution.
No matter if your data pipelines are handling real-time event-driven streams, near-real-time streams, or batch processing jobs. When you work with a massive amount of data made out of small files, specifically parquet, your system performance will degrade.
A small file is one that is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3, Azure Blob, etc., there is minimum block size. Having a significantly smaller object file can result in wasted space on the disk since the storage is optimized to support fast read and write for minimal block size.
To understand why this happens, you need first to understand how cloud storage works with the Apache Spark engine. In this session, you will learn about Parquet, the Storage API calls, how they work together, why small files are a problem, and how you can leverage DeltaLake for a more straightforward, cleaner solution.
What is this delta lake thing?
Why Delta Lake is the Best Storage Format for Pandas Analyses
What is Delta Lake?
What is a Delta Lake? [Introduction to Delta Lake - Ep. 1]
Database vs Data Warehouse vs Data Lake | What is the Difference?
Making Apache Spark™ Better with Delta Lake
Delta Lake for Apache Spark - Why do we need Delta Lake for Spark?
Data Warehouse vs Data Lake vs Data Lakehouse | What is the Difference? (2025)
Day 13: Delta Lake & Delta Tables Explained | Databricks Hands-on with PySpark
Data Lake vs. Delta Lake (aka Data Lakehouse): Which is Right for You?
Delta Lake - Things you need to know - 1. What is it?
Building Reliable Lakehouses with Delta Lake
What Table Format Should I Choose For My Data Lake? Hudi | Iceberg | Delta Lake
Common Strategies for Improving Performance on Your Delta Lakehouse
Great Article|Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison by OneHouse
Unleashing the Power of Delta Lake Revolutionizing Data Storage in the Cloud
Tips and Tricks- Delta Lake Table in Apache Spark - Azure Data Engineering Interview Question
Building a Better Delta Lake with Talend and Databricks
Understanding Delta Lake - The Heart of the Data Lakehouse
Seattle Spark + AI Meetup: How Apache Spark™ 3.0 and Delta Lake Enhance Data Lake Reliability
Advancing Spark - Give your Delta Lake a boost with Z-Ordering
Delta Lake - EXPLAINED - Full Tutorial
What is a Delta Lake Data Strategy? AtScale Data Driven Podcast #shorts
Demystifying Delta Lake. Data Brew | Season 1 Episode 3
Комментарии