filmov
tv
Spark Tutorial : Different ways to create RDD with examples?
![preview_player](https://i.ytimg.com/vi/gMYtqyprpzU/sddefault.jpg)
Показать описание
There are three ways to create an RDD.
The first way to create an RDD is to parallelize an object collection, meaning
converting it to a distributed dataset that can be operated in parallel. This is simple and doesn’t require any data files.
This approach is often used to quickly try a feature or do some experimenting in Spark.
The way to parallelize an object collection is to call the parallelize method of the
SparkContext class.
First way to create RDD:
-------------------------------
val sc=new SparkContext("local[*]","union");
val stringList = Array("Welcome to spark tutorials","Spark examples")
Second way to create RDD:
-------------------------------
The second way to create an RDD is to read a dataset from a storage system, which
can be a local computer file system, HDFS, Cassandra, Amazon S3, and so on.
The first argument of the textFile method is an URI that points to a path or a file on the local machine or to a remote storage system. When it starts with an hdfs:// prefix, it
points to a path or a file that resides on HDFS, and when it starts with an s3n:// prefix, then it points to a path or a file that resides on AWS S3.
If a URI points to a directory, then the textFile method will read all the files in that directory.
The textFile method assumes each file is a text file and each line is delimited by a new line. The textFile method returns an RDD that represents all the lines in all the
files.
Third way to create RDD:
-------------------------------
The third way to create an RDD is by invoking one of the transformation operations on an existing RDD.
The first way to create an RDD is to parallelize an object collection, meaning
converting it to a distributed dataset that can be operated in parallel. This is simple and doesn’t require any data files.
This approach is often used to quickly try a feature or do some experimenting in Spark.
The way to parallelize an object collection is to call the parallelize method of the
SparkContext class.
First way to create RDD:
-------------------------------
val sc=new SparkContext("local[*]","union");
val stringList = Array("Welcome to spark tutorials","Spark examples")
Second way to create RDD:
-------------------------------
The second way to create an RDD is to read a dataset from a storage system, which
can be a local computer file system, HDFS, Cassandra, Amazon S3, and so on.
The first argument of the textFile method is an URI that points to a path or a file on the local machine or to a remote storage system. When it starts with an hdfs:// prefix, it
points to a path or a file that resides on HDFS, and when it starts with an s3n:// prefix, then it points to a path or a file that resides on AWS S3.
If a URI points to a directory, then the textFile method will read all the files in that directory.
The textFile method assumes each file is a text file and each line is delimited by a new line. The textFile method returns an RDD that represents all the lines in all the
files.
Third way to create RDD:
-------------------------------
The third way to create an RDD is by invoking one of the transformation operations on an existing RDD.