filmov
tv
Experience Of Optimizing Spark SQL When Migrating from MPP Database - Yucai Yu and Yuming Wang eBay
Показать описание
eBay is migrating its 30 PB MPP database to Apache Spark. Nowadays, 15000+ ETL jobs have been running on a 1000+ nodes Spark cluster each day, processing PB scale data and these numbers are increasing quickly. Optimization is critical during the migration, because the cluster resource is usually very stressful, well-optimized system can hold more jobs in the limited resource.
In this session, we will talk about the top performance challenges we encountered and how we addressed them. Every month, more batch jobs were being moved to Spark, it put much pressure on the cluster resource, especially the memory capacity. When we deep dived into the top 10 memory-intensive queries, we found that improper Spark configuration, such as executor memory and shuffle partition, lead to serious waste of memory. We will share a unified configuration solution, which is based on adaptive execution, a joint work by Intel and eBay, it helps us save half of the memory and huge human tuning effort.
Next, we have some very big historical tables, to process them efficiently, we need do both bucket and partition. But this way often leads to huge small files when the bucket number is big. In eBay, we use both Spark SQL’s bucket feature and parquet’s min-max index to implement the indexed bucket table, which shows the very good performance. Some important cases gets 2.5x improvement.
Finally, data skew is very common in large data warehouse, some wired OOMs are caused by it. We will root cause them and show an improved join algorithm for generic skewed join handling based on the runtime transformation.
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Connect with us:
In this session, we will talk about the top performance challenges we encountered and how we addressed them. Every month, more batch jobs were being moved to Spark, it put much pressure on the cluster resource, especially the memory capacity. When we deep dived into the top 10 memory-intensive queries, we found that improper Spark configuration, such as executor memory and shuffle partition, lead to serious waste of memory. We will share a unified configuration solution, which is based on adaptive execution, a joint work by Intel and eBay, it helps us save half of the memory and huge human tuning effort.
Next, we have some very big historical tables, to process them efficiently, we need do both bucket and partition. But this way often leads to huge small files when the bucket number is big. In eBay, we use both Spark SQL’s bucket feature and parquet’s min-max index to implement the indexed bucket table, which shows the very good performance. Some important cases gets 2.5x improvement.
Finally, data skew is very common in large data warehouse, some wired OOMs are caused by it. We will root cause them and show an improved join algorithm for generic skewed join handling based on the runtime transformation.
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Connect with us: