How to Performance-Tune Apache Spark Applications in Large Clusters

Показать описание

Omkar Joshi offers an overview on how performance challenges were addressed at Uber while rolling out its newly built flagship ingestion system, Marmaray (open-sourced) for data ingestion from various sources like Kafka, MySQL, Cassandra, and Hadoop. This system is rolled out in production and has been running for over a year now, with more ingestion systems onboarded on top of it. Omkar and team heavily used jvm-profiler during their analysis to give them valuable insights. This new system is built using the Spark framework for data ingestion. It’s designed to ingest billions of Kafka messages per topic from thousands of topics every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. At this scale, every byte and millisecond saved counts. Omkar detail how to tackle such problems and insights into the optimizations already done in production.

Some key highlights are:

- how to understand your bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data
- how to effectively use accumulators to avoid unnecessary Spark actions
- how to inspect your heap and non heap memory usage across hundreds of executors
- how you can change the layout of your data to save long-term storage cost
- how to effectively use serializers and compression to save network and disk traffic
- how to reduce amortized cost of your application by multiplexing your jobs.

They used different techniques for reducing memory footprint, runtime, and on-disk usage for the running applications. In terms of savings, they were able to significantly (~10% – 40%) reduce memory footprint, runtime, and disk usage.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Комментарии

This is enterprise level explanation which is highly useful. Great work Omkar !!

catchritesh

Probably the best talk so far citing the real life issues faced and their solutions.

oldschoolwreak

Loved this talk. Just one comment at 8:36 (Referring to example provided of 100 rows) Parquet is not purely columnar. It is actually hybrid, where the rows are divided into RowGroups and each RowGroup is stored in a columnar format. This hybrid format actually helps in row reconstruction. Also, with Apache Delta coming becoming more mainstream (which also uses Parquet but with a commit log) there is little reason to use pure Parquet :)

thomsondcruz

Very useful ideas from real life scenarios

BuvanAlmighty

@omkar thanks for your talk and just to let u know we are facing yarn memory overhead issue with spark 2.4 as well when we are doing spark sql joins

MrTigerman

I am new to spark. Can anyone please tell me exactly for which operations 5 stages in left diagram and 2 stages in right diagram are formed?

shubhamshingi

How to Performance-Tune Apache Spark Applications in Large Clusters

How to Performance-Tune Apache Spark Applications in Large Clusters

Fine Tuning and Enhancing Performance of Apache Spark Jobs

10 Ways |Spark Performance Tuning | Apache Spark Tutorial

Apache Spark Performance Tuning on Databricks | Scenario based Spark performance tuning course

Apache Spark Optimization Techniques, Performance Tuning | Pepperdata

Spark performance optimization Part1 | How to do performance optimization in spark

Understanding Databricks & Apache Spark Performance Tuning: Lesson 01 - Spark Architecture

Apache Spark Performance Tuning Course | Tuning Terabyte Join | Tuning large table joins

A data lake on your cloud with Spark, Kubernetes and OpenStack

XenonStack - Apache Spark Optimisation Techniques and Performance Tuning

Master Reading Spark Query Plans

How to Automate Performance Tuning for Apache Spark -Jean Yves Stephan (Data Mechanics)

What is Apache Spark Performance Tuning | How do you approach Spark Performance Tuning Problem

How to Tune and Optimize the Performance of Apache Spark Data Pipelines - Dave Goodhand

From Query Plan to Performance: Supercharging your Apache Spark Queries using the Spark UI SQL Tab

Apache Spark / Pyspark Optimization Techniques and Performance Tuning

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

Apache Spark Executor Tuning | Executor Cores & Memory

Understanding Databricks & Apache Spark Performance Tuning: Lesson 02 - Spark Hardware

Apache Spark & 💥 & Deltalake Performance 🚀Tuning Tips #shorts

Spark Executor Core & Memory Explained

Spark Out of Memory Issue | Spark Memory Tuning | Spark Memory Management | Part 1

Performance Tuning for Big Data Processing using Apache Spark in a Large Cluster Environment

Apache Spark - Pandas On Spark | Spark Performance Tuning | Spark Optimization Technique