Spark with Python Course. Lesson 1. Create Parallelized Collection RDD

Показать описание

A parallelized collection in Spark represents a distributed dataset of items that can be operated in parallel, in different nodes in the Spark Cluster.

The example on the video shows how to use SparkContext object to create a parallelized collection from a list of words.

Once the RDD (Resilient Distributed Dataset) has been created, it is possible to interact with it by performing different transformations and actions that are available in the Spark API.

The example on the video shows how to create a new RDD from the primary RDD by excluding words with lengths that are less than three characters.
For this Spark filter and lambda functions has been used.

RDD datasets can be operated in parallel. An important parameter that can be used when creating parallelized collection is the number of partitions to split the dataset into. Spark executes one task in every partition; a common approach is to use between two and four partitions by CPU, although Spark attempts to set this number automatically.

Used:
- Python 3.5.3
- Enthought Canopy
- Spark 2.3.2

Prepared by Vytautas Bielinskas.

Рекомендации по теме

Комментарии

Thank you for watching this video about how to create Parallelized RDD in Spark with Python. I suppose you gonna be interested in some more videos like this:

DataScienceGarage

dai ooma voice add panna thaan ennada....

Steplearn

Spark with Python Course. Lesson 1. Create Parallelized Collection RDD

PySpark Tutorial

Apache Spark / PySpark Tutorial: Basics In 15 Mins

Spark with Python Course. Lesson 1. Create Parallelized Collection RDD

NEW: Learn Apache Spark with Python | PySpark Tutorial For Beginners FULL Course [2024]

Python Tutorial: PySpark: Spark with Python

PySpark Tutorial in 60 Minutes | Introduction to Apache Spark with Python | Edureka Live

Spark Full Course | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn

PySpark Course: Big Data Handling with Python and Apache Spark

5. Work with Data in Spark Data frame in Microsoft Fabric | #microsoftfabric #microsoft #azure

PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Intellipaat

PySpark Full Course [2024] | Learn PySpark | PySpark Tutorial | Edureka

PySpark Training | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn

PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn

The ONLY PySpark Tutorial You Will Ever Need.

Apache Spark - Computerphile

01 Python Fundamentals - Introduction (for Spark)

PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python | Edureka

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka

Pyspark Tutorial for Beginners | Apache Spark with Python | Intellipaat

What exactly is Apache Spark? | Big Data Tools

Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn

Big Data Analytics using Spark with Python | PySpark Tutorial | Edureka Live

Quick start Spark NLP on Python

Python Tutorial: Introduction to Spark SQL in Python