Spark with Python Course. Lesson 1. Create Parallelized Collection RDD

preview_player
Показать описание
A parallelized collection in Spark represents a distributed dataset of items that can be operated in parallel, in different nodes in the Spark Cluster.

The example on the video shows how to use SparkContext object to create a parallelized collection from a list of words.

Once the RDD (Resilient Distributed Dataset) has been created, it is possible to interact with it by performing different transformations and actions that are available in the Spark API.

The example on the video shows how to create a new RDD from the primary RDD by excluding words with lengths that are less than three characters.
For this Spark filter and lambda functions has been used.

RDD datasets can be operated in parallel. An important parameter that can be used when creating parallelized collection is the number of partitions to split the dataset into. Spark executes one task in every partition; a common approach is to use between two and four partitions by CPU, although Spark attempts to set this number automatically.

Used:
- Python 3.5.3
- Enthought Canopy
- Spark 2.3.2

Prepared by Vytautas Bielinskas.
Рекомендации по теме
Комментарии
Автор

Thank you for watching this video about how to create Parallelized RDD in Spark with Python. I suppose you gonna be interested in some more videos like this:

DataScienceGarage
Автор

dai ooma voice add panna thaan ennada....

Steplearn