An Adaptive Execution Engine For Apache Spark SQL - Carson Wang

preview_player
Показать описание
"Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.

Session hashtag: EUdev4"

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

Great contribution, thank you! This works well together with Dynamic Resource Allocation in Spark 2.4.4. For smaller files, when default parallelism is used, 200 tasks are created in a join operation and DRA maxes out. This results in low efficiency, since each executors processes just a tiny partition. Enabling Adaptive Execution keeps resource usage much more reasonable. Also, the output gets written to fewer and larger files, which is good for HDFS.

DzmitryMakatun
Автор

What's the way to activate shuffle partition automatically in Spark 2.2?

ebertoburgos