Spark Tutorial : Different ways to create RDD with examples?

preview_player
Показать описание
There are three ways to create an RDD.

The first way to create an RDD is to parallelize an object collection, meaning

converting it to a distributed dataset that can be operated in parallel. This is simple and doesn’t require any data files.

This approach is often used to quickly try a feature or do some experimenting in Spark.

The way to parallelize an object collection is to call the parallelize method of the

SparkContext class.

First way to create RDD:

-------------------------------

val sc=new SparkContext("local[*]","union");

val stringList = Array("Welcome to spark tutorials","Spark examples")

Second way to create RDD:

-------------------------------

The second way to create an RDD is to read a dataset from a storage system, which

can be a local computer file system, HDFS, Cassandra, Amazon S3, and so on.

The first argument of the textFile method is an URI that points to a path or a file on the local machine or to a remote storage system. When it starts with an hdfs:// prefix, it

points to a path or a file that resides on HDFS, and when it starts with an s3n:// prefix, then it points to a path or a file that resides on AWS S3.

If a URI points to a directory, then the textFile method will read all the files in that directory.

The textFile method assumes each file is a text file and each line is delimited by a new line. The textFile method returns an RDD that represents all the lines in all the

files.

Third way to create RDD:

-------------------------------

The third way to create an RDD is by invoking one of the transformation operations on an existing RDD.
Рекомендации по теме