Apache Spark Internals: RDDs, Pipelining, Narrow & Wide Dependencies

Показать описание

In this video we'll understand Apache Spark's most fundamental abstraction layer: RDDs. Understanding this is essential for writing performant Spark code and comprehending what's going on during an execution.

00:00 Introduction
01:11 Traits of RDDs
04:34 Code Interface of RDDs
06:44 Understanding transformations
08:20 The DAG - directed acyclic graph
11:38 Types of dependencies
15:26 Optimization: Pipelining
17:47 Implementation of transformations
19:58 Summary

Philipp Brunenberg

Рекомендации по теме

Комментарии

this is one of the best video I have ever seen.keep it up boss

advancetalks

So by pipelining, we mean that map() and filter will be running in parallel,
map() running in parallel on all the 3 partitions of RDD1 and filter running in parallel on the 2 partitions of RDD3. Both of these will be running simultaneously.
Please could you let me know if my understanding is correct?
Please could you explain using the SparkUI too?

I had read earlier on, that one RDD going through a set of Narrow transformations was known as Pipelining. As in, each partition of the RDD acts as a pipeline, and goes through a set of Narrow transformations. I have the below:
text_file =
words = text_file.flatMap(lambda x:x.split(" "))
words1 = words.map(lambda x:(x, 1))
print(type(words1))
gave output on the lines of : <PipelinedRDD>
(couldn't check the above code right now as i am having some troubles with my set-up, but this is what i had done earlier)

suganyakumar

Thank you so much for the video. explained beautifully.

I was trying out some RDD operations with the code below and if you could help me out, i would be grateful.

package org.souvik.application

import

object Main {
def main(args: Array[String]): Unit = {

val spark = SparkSession
.builder()
.master("local[*]")
.appName("RDD-practice")
.getOrCreate()

val RDD1 = sc.parallelize(Array(1, 2, 3, 4))
println(RDD1.collect.mkString(", "))

}

}

intellij however does not recognise the action "collect". i tried to import as well. I have added the dependency for spark core.

there is no error while executing with spark-shell.

Thank you in advance.

souvikray

Apache Spark Internals: RDDs, Pipelining, Narrow & Wide Dependencies

Apache Spark Internals: RDDs, Pipelining, Narrow & Wide Dependencies

Apache Spark Internals: Understanding Physical Planning (Stages, Tasks & Pipelining)

Learn Apache Spark in 10 Minutes | Step by Step Guide

DataXDay - EN -The internals of query execution in Spark SQL

Introduction to AmpLab Spark Internals

Spark Internals and Architecture in Azure Databricks

Apache Spark Internals: Task Scheduling - Execution of a Physical Plan

A Deep Dive into Query Execution Engine of Spark SQL - Maryann Xue

Lessons Learned Developing and Managing Massive 300TB+ Apache Spark Pipelines

RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016

Apache Spark Internals - The Internals Of Apache Spark Execution

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji

Building a Versatile Analytics Pipeline on Top of Apache Spark - Mikhail Chernetsov

Apache Spark RDD introduction, narrow dependency, wide dependency, Pipe line concepts introduction

A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)

Building a Unified 'Big Data' Pipeline in Apache Spark by Aaron Davidson at ScalaMatsuri20...

Internals of Speeding up PySpark with Arrow - Ruben Berenguel (Consultant)

From Pipelines to Refineries: Building Complex Data Applications with Apache Spark - Tim Hunter

Tuning and Debugging Apache Spark

Apache Spark as a Platform for Powerful Custom Analytics Data Pipeline: Talk by Mikhail Chernetsov

Demystifying DataFrame and Dataset - Dr. Kazuaki Ishizaki

Building Machine Learning Algorithms on Apache Spark - William Benton

Extending Spark Machine Learning Beyond Linear Regression by Holden Karau

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu