Experience Of Optimizing Spark SQL When Migrating from MPP Database - Yucai Yu and Yuming Wang eBay

Показать описание

eBay is migrating its 30 PB MPP database to Apache Spark. Nowadays, 15000+ ETL jobs have been running on a 1000+ nodes Spark cluster each day, processing PB scale data and these numbers are increasing quickly. Optimization is critical during the migration, because the cluster resource is usually very stressful, well-optimized system can hold more jobs in the limited resource.

In this session, we will talk about the top performance challenges we encountered and how we addressed them. Every month, more batch jobs were being moved to Spark, it put much pressure on the cluster resource, especially the memory capacity. When we deep dived into the top 10 memory-intensive queries, we found that improper Spark configuration, such as executor memory and shuffle partition, lead to serious waste of memory. We will share a unified configuration solution, which is based on adaptive execution, a joint work by Intel and eBay, it helps us save half of the memory and huge human tuning effort.

Next, we have some very big historical tables, to process them efficiently, we need do both bucket and partition. But this way often leads to huge small files when the bucket number is big. In eBay, we use both Spark SQL’s bucket feature and parquet’s min-max index to implement the indexed bucket table, which shows the very good performance. Some important cases gets 2.5x improvement.

Finally, data skew is very common in large data warehouse, some wired OOMs are caused by it. We will root cause them and show an improved join algorithm for generic skewed join handling based on the runtime transformation.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Experience Of Optimizing Spark SQL When Migrating from MPP Database - Yucai Yu and Yuming Wang eBay

Experience Of Optimizing Spark SQL When Migrating from MPP Database - Yucai Yu and Yuming Wang eBay

How We Optimize Spark SQL Jobs With parallel and sync IO

Optimizing Apache Spark SQL at LinkedIn

Improving Interactive Querying Experience on Spark SQL

Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha

Improving interactive querying experience on Spark SQL - Ashish Singh, Sanchay Javeria

Spark SQL Catalyst Code Optimization using Function Outlining with Madhusudanan Kandasamy IBM

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

SQL Query Optimization - Tips for More Efficient Queries

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

Spark Interview questions - Spark SQL Optimization : Catalyst Optimization with real time scenario

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader

Enhancements on Spark SQL optimizer

Data Caching in Apache Spark | Optimizing performance using Caching | When and when not to cache

Bucketing in Spark SQL 2 3 with Jacek Laskowski

Top 15 Spark Interview Questions in less than 15 minutes Part-2 #bigdata #pyspark #interview

Optimize read from Relational Databases using Spark

Spark Join and shuffle | Understanding the Internals of Spark Join | How Spark Shuffle works

Spark SQL Beyond Official Documentation

Shuffle Partition Spark Optimization: 10x Faster!

Developer Meet-up: The Spark SQL Optimizer and External Data Sources API

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

Optimizing Apache Spark UDFs

Applying SparkSQL to Big Spatio Temporal Data Using GeoMesa - Anthony Fox