Optimizing Apache Spark SQL at LinkedIn

Показать описание

Presenter: Fangshi Li

Presented at the Bay Area Apache Spark Meetup hosted at LinkedIn in August 2019.

Abstract: Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach

Bio: Fangshi Li is a software engineer at Linkedin. He has been working on Spark core infrastructure, user libraries, AI solutions, and Spark SQL engine optimizations. He was one of the original developers of Dr. Elephant, the performance tuning tool for Hadoop/Spark.

LinkedIn Engineering

Рекомендации по теме

Комментарии

The 2d partitioned join looks really promising. Especially for star schema kind of table where a fact can be joined with multiple dimension. Is there any open sourced sample that we can use?

JoHeN

Optimizing Apache Spark SQL at LinkedIn

Optimizing Apache Spark SQL at LinkedIn

Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

Exploring Spark SQL Optimizations Part-1 | Spark SQL | Apache Spark | Optimizations

95% reduction in Apache Spark processing time with correct usage of repartition() function

Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad Carlile

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

How We Optimize Spark SQL Jobs With parallel and sync IO

Apache Spark for Machine Learning on Large Data Sets • Juliet Hougland • YOW! 2017

From Query Plan to Performance: Supercharging your Apache Spark Queries using the Spark UI SQL Tab

Optimizing Apache Spark UDFs

Apache Spark Joins for Optimization | PySpark Tutorial

SQL Performance Improvements at a Glance in Apache Spark 3.0

Secret To Optimizing SQL Queries - Understand The SQL Execution Order

Spark performance optimization Part1 | How to do performance optimization in spark

Deep Dive into Query Execution in Spark SQL 2 3 with Jacek Laskowski

Understanding Query Plans and Spark UIs - Xiao Li Databricks

Optimize read from Relational Databases using Spark

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Understanding the Working of Apache Spark's Catalyst Optimizer in Improving the Query Performan...

optimization in spark

Spark Basics | Partitions

Apache Spark Core – Practical Optimization Daniel Tomes (Databricks)

Cost Based Optimizer in Apache Spark 2 2 - Ron Hu & Sameer Agarwal