Apache Spark Optimization Techniques, Performance Tuning | Pepperdata

Показать описание

Get Spark Performance Tuning Tips from a Veteran Field Engineer

#sparkoptimization #bigdataperformancereport #pepperdata

00:00:01:15 - 00:00:08:12
Hello, this is Alex Pierce, field engineer with Pepperdata, and this is what you should know about Spark optimization.

00:00:12:15 - 00:00:46:21
Why Apache Spark? Several reasons. First of all: speed. Apache Spark compared to traditional Hadoop ETL type batch workloads is approximately a hundred times faster for in-memory work and ten times faster on disk due to the efficiency of its pipeline distributed architecture. It is easy to use and available in many languages including Java, Scala, R, SQL, and Python which is now the most popular language to interface with Spark with. Its generality.

00:00:46:23 - 00:01:22:20
It has libraries including access through SQL, DataFrames, MLlib for machine learning, Graphx, and Spark Streaming, and all of these libraries can be combined within a single application. Also, flexibility. It runs on many platforms including the Hadoop YARN scheduler, Apache Mesos, Kubernetes, standalone or in the cloud, and it provides access through many data sources: HDFS, S3, Aluxio, Cassandra HBase, Hive, and also other relational and non-relational databases.

00:01:23:11 - 00:02:41:07
However, there are some challenges to Spark. One of the things Pepperdata has observed is that Spark jobs tend to fail more than other jobs. As you can see here in this chart, taken from our big data performance report, Spark is approximately four to seven times more likely to fail than other applications that we have observed within our customer base. So this is about Spark optimization. How do you do this? One of the most important parts is observability, in order for you to understand what needs to be optimized, you need to understand where the opportunities for optimization are and what needs to be changed. For example, looking at memory utilization. If a tool can tell you exactly how your memory could be optimized, maybe you'd need to use more memory because you're seeing garbage collection. Maybe you need to use less memory because you're asking for more than you are actually utilizing thereby causing problems and queuing in multi-tenant environments. Spark is also sensitive to data skew, in a highly distributed paralyzed application such as Spark, data skew can be very painful, causing parts of your application to last much longer than they should and causing other compute resources to sit idle in the meantime.

00:02:41:12 - 00:03:22:26
So being able to observe when there is data skew and take recommendations of what to do with this data skew is very important. So how do you measure success in optimizing your Spark workload? Observability is the key. You need to be able to say "hey, my applications are running without failures, my SLAs are being met consistently, and also my chosen observability tool no longer indicates there are problems with memory utilization, with data skew, with other things that while my application may work it could work better and would be a better tenant and a multi-tenant environment that most of us work in."

00:03:23:21 - 00:03:48:21

Pepperdata Big Data Performance Report 2020

/////////////////////////////////////////////////////////////////////////////////////////

Connect with us:

Рекомендации по теме

Apache Spark Optimization Techniques, Performance Tuning | Pepperdata

Spark performance optimization Part1 | How to do performance optimization in spark

Apache Spark Optimization Techniques, Performance Tuning | Pepperdata

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

Data Caching in Apache Spark | Optimizing performance using Caching | When and when not to cache

XenonStack - Apache Spark Optimisation Techniques and Performance Tuning

Spark performance optimization Part 2| How to do performance optimization in spark

10 Ways |Spark Performance Tuning | Apache Spark Tutorial

Top Spark Performance Tuning Interview Questions and Answers

Apache Spark Performance Tuning on Databricks | Scenario based Spark performance tuning course

optimization in spark

From Query Plan to Performance: Supercharging your Apache Spark Queries using the Spark UI SQL Tab

95% reduction in Apache Spark processing time with correct usage of repartition() function

Optimizing Apache Spark SQL at LinkedIn

Apache Spark Optimization with @priyachauhan813 . Check the full video #apachespark

Apache Spark - Pandas On Spark | Spark Performance Tuning | Spark Optimization Technique

Optimize Tricks of PySpark | Databricks Tutorial | PySpark |

Master Reading Spark Query Plans

How to Performance-Tune Apache Spark Applications in Large Clusters

Optimizing Spark

Spark Out of Memory Issue | Spark Memory Tuning | Spark Memory Management | Part 1

Apache Spark / Pyspark Optimization Techniques and Performance Tuning

102. Databricks | Pyspark |Performance Optimization: Spark/Databricks Interview Question Series - II

Apache Spark Core – Practical Optimization Daniel Tomes (Databricks)