Most common filesystems used by apache Spark

Показать описание

There are various filesystems like MapR filesystem, Google Cloud Storage, Amazon S3, and HDFS that Spark can use to read and store data. In this video, we are going to look at the 2 most commonly used filesystems which are HDFS and Amazon S3.

HDFS vs Amazon S3

HDFS filesystem vs Amazon S3 filesystem

HDFS
⮚ HDFS is a commonly used distributed file system that works seamlessly with spark.

⮚ HDFS is resilient to node failure, distributed, scalable, and built with cheap commodity hardware.

⮚ Spark can read and write data from HDFS files easily using the below statements

Amazon S3
⮚ Amazon S3 is a simple storage service that is offered by amazon web service.

⮚ It is a scalable, high speed, low cost storage service.

⮚ Spark can read and write data from Amazon S3 files easily using the below statements

In this video, we saw that HDFS and Amazon S3 are the 2 most commonly used filesystems by spark. We also saw a glimpse on how to read and write data from the respective filesystems using Apache Spark.