What is a Spark Dataframe?

Показать описание

Spark Dataframe is a distributed collection of data, which was released in spark 1.3 release. You may be wondering, even the RDD is a distributed collection of data, then how really is the dataframe different from RDD. Yes, there are lot of similarities with the RDD that dataframe has. Like RDD, Dataframe is also immutable, lazily evaluated, distributed dataset, can be created from different sources, supports different file formats.

1) The difference is that the Dataframe is organized into named columns, whereas a RDD is not.
2) Also, the dataframe comes with a Tungsten component that helps to store the data in binary format, which helps avoid serialization and garbage collection.
3) Dataframe also comes with a catalyst optimizer, which helps spark to reevaluate the physical and logical query execution plan resulting in a new optimized DAG.

To summarize Dataframe is a distributed collection of data that is organized into named columns, which comes with custom memory management (Tungsten component) and optimization features (Catalyst optimizer)