filmov
tv
RDD vs Dataframe vs Dataset
Показать описание
ATTENTION DATA SCIENCE ASPIRANTS:
Click Below Link to Download Proven 90-Day Roadmap to become a Data Scientist in 90 days
When talking about the difference between RDD, Dataframe, and Dataset almost all 3 of them provides most of the features. For example, RDD, Dataframe, and Dataset
• supports processing structured and unstructured data.
• All 3 support file formats like text file, CSV file, JSON, and parquet file formats.
• All the 3 abstractions support accessing data from data sources like RDBMS, text file, HDFS, or NoSQL databases.
• Again, all the 3 are immutable collections.
• Finally, the operations are lazily evaluated in all the 3 collections.
Then what really is the difference between the 3 Spark abstractions. In this video, I am going to tell you 7 concrete differences between these 3 abstractions. Let’s start with the first one.
1) The “RDD” and “Dataframe” is a distributed collection of elements, however in “Dataframe” the data is organized into named columns. It is conceptually equivalent to a table in RDBMS. Dataset is an extension of Dataframe.
2) The “RDD” is a low-level abstraction that was introduced in Apache spark 1.0 release itself. Whereas the Dataframe was introduced in 1.3 release. And Dataset abstraction was released in 1.6 release.
3) Dataframe doesn’t provide compile time type safety. For e.g., if you try to access a column from dataframe, which is not available, then it doesn’t throw error during compile time. It throws the error only when you try to execute the code. However, this is not the case with the dataset. Dataset throws the error as you type the wrong column name. This saves developer time and cost.
4) APIs like agg, select, sum, avg are introduced in dataframe and datasets. These operations make the code much more readable than the algebraic type operations in RDD. For e.g. Look at the below word count program. Let’s say “rdd” is a RDD that contains text for word count. df is the dataframe that contains the text for word count. All it does is calculates the number of times each word occur. We can see that the operations on dataframe are much more readable that the RDD counterpart operations.
The goal here is not to show how to write word count program, but to show how readable the operations are, when working on dataframes, compared to RDDs.
5) Dataframes and Datasets API are built on top of Spark SQL engine, and we can write SQL queries to access the elements. Whereas Spark SQL cannot be used on RDDs.
6) Catalyst optimizer is introduced in Dataframes which optimizes the performance by regenerating most optimized physical and logical query execution plan. This results in more efficiency and speed. Datasets also leverages this functionality. However, RDDs doesn’t have the benefits of Catalyst optimizer.
7) RDDs doesn’t use Tungsten component. Whereas Dataset and Dataframes uses Tungsten component (part of Spark SQL engine) that enables storing the data in off-heap memory in binary format. This provides 3 main advantages.
a. Avoids garbage collection by storing the data in off-heap memory
b. Occupies less memory space and
c. Avoids expensive java serialization by storing the data in binary format.
8) Dataset provides advanced encoders, which can provide on-demand access to individual attributes. Dataframe doesn’t have this feature.
Click Below Link to Download Proven 90-Day Roadmap to become a Data Scientist in 90 days
When talking about the difference between RDD, Dataframe, and Dataset almost all 3 of them provides most of the features. For example, RDD, Dataframe, and Dataset
• supports processing structured and unstructured data.
• All 3 support file formats like text file, CSV file, JSON, and parquet file formats.
• All the 3 abstractions support accessing data from data sources like RDBMS, text file, HDFS, or NoSQL databases.
• Again, all the 3 are immutable collections.
• Finally, the operations are lazily evaluated in all the 3 collections.
Then what really is the difference between the 3 Spark abstractions. In this video, I am going to tell you 7 concrete differences between these 3 abstractions. Let’s start with the first one.
1) The “RDD” and “Dataframe” is a distributed collection of elements, however in “Dataframe” the data is organized into named columns. It is conceptually equivalent to a table in RDBMS. Dataset is an extension of Dataframe.
2) The “RDD” is a low-level abstraction that was introduced in Apache spark 1.0 release itself. Whereas the Dataframe was introduced in 1.3 release. And Dataset abstraction was released in 1.6 release.
3) Dataframe doesn’t provide compile time type safety. For e.g., if you try to access a column from dataframe, which is not available, then it doesn’t throw error during compile time. It throws the error only when you try to execute the code. However, this is not the case with the dataset. Dataset throws the error as you type the wrong column name. This saves developer time and cost.
4) APIs like agg, select, sum, avg are introduced in dataframe and datasets. These operations make the code much more readable than the algebraic type operations in RDD. For e.g. Look at the below word count program. Let’s say “rdd” is a RDD that contains text for word count. df is the dataframe that contains the text for word count. All it does is calculates the number of times each word occur. We can see that the operations on dataframe are much more readable that the RDD counterpart operations.
The goal here is not to show how to write word count program, but to show how readable the operations are, when working on dataframes, compared to RDDs.
5) Dataframes and Datasets API are built on top of Spark SQL engine, and we can write SQL queries to access the elements. Whereas Spark SQL cannot be used on RDDs.
6) Catalyst optimizer is introduced in Dataframes which optimizes the performance by regenerating most optimized physical and logical query execution plan. This results in more efficiency and speed. Datasets also leverages this functionality. However, RDDs doesn’t have the benefits of Catalyst optimizer.
7) RDDs doesn’t use Tungsten component. Whereas Dataset and Dataframes uses Tungsten component (part of Spark SQL engine) that enables storing the data in off-heap memory in binary format. This provides 3 main advantages.
a. Avoids garbage collection by storing the data in off-heap memory
b. Occupies less memory space and
c. Avoids expensive java serialization by storing the data in binary format.
8) Dataset provides advanced encoders, which can provide on-demand access to individual attributes. Dataframe doesn’t have this feature.
Комментарии