Optimizing Apache Spark SQL at LinkedIn

preview_player
Показать описание
Presenter: Fangshi Li

Presented at the Bay Area Apache Spark Meetup hosted at LinkedIn in August 2019.

Abstract: Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach

Bio: Fangshi Li is a software engineer at Linkedin. He has been working on Spark core infrastructure, user libraries, AI solutions, and Spark SQL engine optimizations. He was one of the original developers of Dr. Elephant, the performance tuning tool for Hadoop/Spark.
Рекомендации по теме
Комментарии
Автор

The 2d partitioned join looks really promising. Especially for star schema kind of table where a fact can be joined with multiple dimension. Is there any open sourced sample that we can use?

JoHeN