55 - Spark RDD - PairRDD - Distinct

Показать описание

@backstreetbrogrammer

--------------------------------------------------------------------------------
Chapter 10 - Spark RDD - PairRDD - Distinct
--------------------------------------------------------------------------------
While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed "shuffle" operations, such as grouping or aggregating the elements by a key.

In Java, key-value pairs are represented using the scala.Tuple2 class from the Scala standard library. We can simply call new Tuple2(a, b) to create a tuple, and access its fields later with tuple._1() and tuple._2().

RDDs of key-value pairs are represented by the JavaPairRDD class. We can construct JavaPairRDD from JavaRDD using special versions of the map operations, like mapToPair and flatMapToPair. The JavaPairRDD will have both standard RDD functions and special key-value ones.

One big difference between a Java Map and Spark's JavaPairRDD is that Map should contain unique keys but JavaPairRDD can have duplicate keys.

For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:

When using custom objects as the key in key-value pair operations, we must be sure that a custom equals() method is accompanied by a matching hashCode() method.

#java #javadevelopers #javaprogramming #apachespark #spark

Rishi Srivastava

Рекомендации по теме

55 - Spark RDD - PairRDD - Distinct

55 - Spark RDD - PairRDD - Distinct

54 - Spark RDD - PairRDD - GroupByKey

How to Use countByKey and countByValue on Spark RDD

Spark RDDs and Implementation

Remove Header Footer from CSV File using Spark Core RDDs

What is RDD in Apache Spark | Spark RDD vs MapReduce | Spark Tutorial |@OnlineLearningCenterIndia

Big Data on Spark | Tutorial for Beginners [Part 15] | RDD - Creation | Great Learning

APACHE SPARK - How to create RDD from existing collection and external file_Hands-On

Peter Hoffmann - PySpark - Data processing in Python on top of Apache Spark.

Modern Spark DataFrame & Dataset | Apache Spark 2.0 Tutorial

Convert RDD to Dataframe in Scala 3 | Azure Databricks #spark #scala3 #azuresynapse #databricks

PySpark Concepts | RDD | Apache Spark | Part 3

PySpark Interview Questions | Azure Data Engineer #azuredataengineer #databricks #pyspark

PySpark Error while saving file- 'Py4JJavaError: An error occurred while calling o31 parquet&ap...

Big Data on Spark Tutorial [Part 17] | RDD - Reading an External file - Demo | Great Learning

Caching & Persistence - Big Data Essentials: HDFS, MapReduce and Spark RDD

Big Data on Spark | Tutorial for Beginners [Part 14] | RDD - Reading the Data | Great Learning

64 - Spark RDD - Joins - Code Demo 1

DE|GCP|Session-22|Spark Architecture, Internals & Processing, RDDs(properties, Operations & ...

Key/Value RDD - Find Average Friends by Age

Apache Spark - Architecture 02 In Depth(RDD,DAG,SHUFFLING)

PySpark Interview Questions | Azure Data Engineer #azuredataengineer #databricks #pyspark

Graph based processing in Apache Spark

PySpark Full Course | Basic to Advanced Optimization with Spark UI PySpark Training | Spark Tutorial