Spark SQL Join Improvement at Facebook

Показать описание

Join is one of most important and critical SQL operation in most data warehouses. This is essential when we want to get insights from multiple input datasets. Over the last year, we’ve added a series of join optimizations internally at Facebook, and we started to contribute back to upstream open source recently.

Specifically,
(1).shuffled hash join improvement (SPARK-32461): add code generation to improve efficiency, add sort-based fallback to improve reliability, add full outer join support, shortcut for empty build side, etc.
(2).join with bloom filter: for shuffled hash join and sort merge join, optionally adding a bloom filter for join keys on large table side to pre-filter rows for saving shuffle and sort cost.
(3).stream-stream join (SPARK-32862 and SPARK-32863): support left semi join and full outer join. In this talk, we’ll take a deep dive into the internals of above join optimizations, and summarize the lessons learned and future planned work for further improvements.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us: