Apache Spark Internals: RDDs, Pipelining, Narrow & Wide Dependencies

preview_player
Показать описание
In this video we'll understand Apache Spark's most fundamental abstraction layer: RDDs. Understanding this is essential for writing performant Spark code and comprehending what's going on during an execution.

00:00 Introduction
01:11 Traits of RDDs
04:34 Code Interface of RDDs
06:44 Understanding transformations
08:20 The DAG - directed acyclic graph
11:38 Types of dependencies
15:26 Optimization: Pipelining
17:47 Implementation of transformations
19:58 Summary
Рекомендации по теме
Комментарии
Автор

this is one of the best video I have ever seen.keep it up boss

advancetalks
Автор

So by pipelining, we mean that map() and filter will be running in parallel,
map() running in parallel on all the 3 partitions of RDD1 and filter running in parallel on the 2 partitions of RDD3. Both of these will be running simultaneously.
Please could you let me know if my understanding is correct?
Please could you explain using the SparkUI too?

I had read earlier on, that one RDD going through a set of Narrow transformations was known as Pipelining. As in, each partition of the RDD acts as a pipeline, and goes through a set of Narrow transformations. I have the below:
text_file =
words = text_file.flatMap(lambda x:x.split(" "))
words1 = words.map(lambda x:(x, 1))
print(type(words1))
gave output on the lines of : <PipelinedRDD>
(couldn't check the above code right now as i am having some troubles with my set-up, but this is what i had done earlier)

suganyakumar
Автор

Thank you so much for the video. explained beautifully.

I was trying out some RDD operations with the code below and if you could help me out, i would be grateful.

package org.souvik.application

import

object Main {
def main(args: Array[String]): Unit = {

val spark = SparkSession
.builder()
.master("local[*]")
.appName("RDD-practice")
.getOrCreate()

val RDD1 = sc.parallelize(Array(1, 2, 3, 4))
println(RDD1.collect.mkString(", "))

}

}

intellij however does not recognise the action "collect". i tried to import as well. I have added the dependency for spark core.

there is no error while executing with spark-shell.

Thank you in advance.

souvikray