Apache Spark Optimization Techniques, Performance Tuning | Pepperdata

preview_player
Показать описание

Get Spark Performance Tuning Tips from a Veteran Field Engineer

#sparkoptimization #bigdataperformancereport #pepperdata

00:00:01:15 - 00:00:08:12
Hello, this is Alex Pierce, field engineer with Pepperdata, and this is what you should know about Spark optimization.

00:00:12:15 - 00:00:46:21
Why Apache Spark? Several reasons. First of all: speed. Apache Spark compared to traditional Hadoop ETL type batch workloads is approximately a hundred times faster for in-memory work and ten times faster on disk due to the efficiency of its pipeline distributed architecture. It is easy to use and available in many languages including Java, Scala, R, SQL, and Python which is now the most popular language to interface with Spark with. Its generality.

00:00:46:23 - 00:01:22:20
It has libraries including access through SQL, DataFrames, MLlib for machine learning, Graphx, and Spark Streaming, and all of these libraries can be combined within a single application. Also, flexibility. It runs on many platforms including the Hadoop YARN scheduler, Apache Mesos, Kubernetes, standalone or in the cloud, and it provides access through many data sources: HDFS, S3, Aluxio, Cassandra HBase, Hive, and also other relational and non-relational databases.

00:01:23:11 - 00:02:41:07
However, there are some challenges to Spark. One of the things Pepperdata has observed is that Spark jobs tend to fail more than other jobs. As you can see here in this chart, taken from our big data performance report, Spark is approximately four to seven times more likely to fail than other applications that we have observed within our customer base. So this is about Spark optimization. How do you do this? One of the most important parts is observability, in order for you to understand what needs to be optimized, you need to understand where the opportunities for optimization are and what needs to be changed. For example, looking at memory utilization. If a tool can tell you exactly how your memory could be optimized, maybe you'd need to use more memory because you're seeing garbage collection. Maybe you need to use less memory because you're asking for more than you are actually utilizing thereby causing problems and queuing in multi-tenant environments. Spark is also sensitive to data skew, in a highly distributed paralyzed application such as Spark, data skew can be very painful, causing parts of your application to last much longer than they should and causing other compute resources to sit idle in the meantime.

00:02:41:12 - 00:03:22:26
So being able to observe when there is data skew and take recommendations of what to do with this data skew is very important. So how do you measure success in optimizing your Spark workload? Observability is the key. You need to be able to say "hey, my applications are running without failures, my SLAs are being met consistently, and also my chosen observability tool no longer indicates there are problems with memory utilization, with data skew, with other things that while my application may work it could work better and would be a better tenant and a multi-tenant environment that most of us work in."

00:03:23:21 - 00:03:48:21

Pepperdata Big Data Performance Report 2020

/////////////////////////////////////////////////////////////////////////////////////////

Connect with us:
Рекомендации по теме