Data Analysis on Large Datasets Using Apache Spark on Amazon EMR

Показать описание

🔬 Experiment: Exploratory Data Analysis on Large Datasets Using Apache Spark on Amazon EMR
🎯 Objective:
To perform scalable and efficient exploratory data analysis (EDA) on a large dataset using Apache Spark running on Amazon EMR, leveraging distributed computing for high performance.
________________________________________
🧰 Tools & Technologies:
• Amazon EMR: Elastic MapReduce for managing Spark clusters
• Apache Spark: In-memory distributed processing
• Amazon S3: Storage for dataset input/output
• Jupyter Notebook / Zeppelin: For interactive analysis
• Python (PySpark): Programming language for Spark jobs
________________________________________
📊 Dataset:
• Use a publicly available large dataset (e.g., NYC Taxi Trips, Wikipedia Pageviews, or Common Crawl)
• Store the dataset in Amazon S3 for Spark to access
________________________________________
🧪 Experimental Setup:
Step 1: Set Up EMR Cluster
• Go to AWS Management Console
• Launch EMR Cluster with:
o Applications: Apache Spark, Hadoop, JupyterHub (optional)
o Set S3 bucket for logs and notebooks
Step 2: Upload Dataset to S3
• Place raw data files (CSV/JSON/Parquet) into a designated S3 bucket
• Ensure proper IAM permissions
Step 3: Connect to EMR via Notebook
• Use Jupyter Notebook or SSH to access the master node
• Start analyzing with PySpark or Spark SQL
Step 4: Perform EDA with Spark
Key tasks include:
1. Data Loading:
1. Basic Exploration:
2. Missing Values & Nulls:
3. Aggregations & Groupings:
4. Visualization (small samples using matplotlib/seaborn in notebooks):
________________________________________
📈 Observations:
• Note distribution of key variables
• Identify outliers, missing data, and skewed categories
• Correlations and trends from grouped aggregations
📌 Results:
• Successfully handled datasets in the range of 10–100+ GB without performance issues
• Spark's distributed architecture enabled faster execution than traditional pandas-based analysis
________________________________________
🧩 Challenges:
• Optimizing memory and partitioning for better performance
• Initial cluster setup and cost management
• Handling skewed data and serialization limits