Seattle Spark + AI Meetup: How Apache Spark™ 3.0 and Delta Lake Enhance Data Lake Reliability

preview_player
Показать описание
Apache Spark™ has become the de-facto open-source standard for big data processing for its ease of use and performance. The open-source Delta Lake project improves Spark’s data reliability, with new capabilities like ACID transactions, Schema Enforcement, and Time Travel.

Join us in this meetup to learn more about the performance improvements in Apache Spark 3.0 including Adaptive Query Execution (AQE), Dynamic Partition Pruning (DPP), and handling skewed queries!

Topics to be covered including:

* The new Adaptive Query Execution (AQE) framework within Spark 3.0 can yield query performance gains. Based on a 3TB TPC-DS benchmark, two queries had more than a 1.5x speedup, and another 37 queries had more than 1.1x speedup.
* With Dynamic Partition Pruning (DPP), we can significantly speed up performance by pruning partitions based on the joins between the fact and dimension tables common in star schema design.
Рекомендации по теме
Комментарии
Автор

Thank you sharing these improvements in Spark 3!

datrumpet
Автор

Does this mean from Spark 3.0 with AQE turned on, there is no need to manually calculate statistics with the "ANALYZE TABLE ..." idiom?

nilanjansarkar