Apache Spark for Data Science #1 - How to Install and Get Started with PySpark | Better Data Science

Показать описание

Want to learn Apache Spark for Data Science? This guide will help you get started. Learn how to install PySpark and load your first dataset with Python.

00:00 Introduction
01:25 Virtual environment setup
03:22 How to start a Spark Session
05:48 How to read datasets with Spark
08:19 Outro

FOLLOW BETTER DATA SCIENCE

FREE “LEARN DATA SCIENCE MASTERPLAN” EBOOK

GEAR I USE

Рекомендации по теме

Комментарии

Hi Dario,

I downloaded the boston housing dataset from kaggle but it's only showing a single column when I run df.show(). Could you confirm that the dataset still works this tutorial?

Karol-cdne

To read all columns in this dataset, just use:
df = spark.read.csv("./housing_formated.csv", header=True, inferSchema=True, sep ='\t')

This dataset needs to be formatted first, adding the correct "\t" and the header. Just run in a cell the next:
***
import re
file_name = "./housing.csv"
with open(file_name, "r") as f_in:
lines = f_in.readlines()

with open("./housing_formated.csv", "w+") as f_out:
header = ["crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "b", "lstat", "medv"]

f_out.write("\n")
f_out.writelines("\n".join(
map(
lambda line: "\t".join(re.split(r"\s+", line.strip()))
, lines
))
)
***

It seems that spark does not support separators with regular expressions like pandas ("\s+") as default...

dannysebastiandiazpadilla

Is it ok if I make a GitHub repo with all of your tutorials on this Apache Spark?

sanyk

Apache Spark for Data Science #1 - How to Install and Get Started with PySpark | Better Data Science

What Is Apache Spark?

Spark Tutorial For Beginners | Big Data Spark Tutorial | Apache Spark Tutorial | Simplilearn

What Is Apache Spark? | Apache Spark Tutorial | Apache Spark For Beginners | Simplilearn

Machine learning with Apache Spark | Machine Learning Essentials

Apache Spark Announcement | Single node data science meets big data | Keynote Data + AI Summit 2021

Apache Spark™ ML and Distributed Learning (1/5)

Data + AI Summit 2021 - Full Thursday AM Keynote on Apache Spark, Data Science + Machine Learning

Learn Apache Spark in 10 Minutes | Step by Step Guide

Felix Cheung - Scalable Data Science in Python and R on Apache Spark

The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark

PySpark Tutorial

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

Powering a Data Science Platform Using Apache Spark - Luis Arellano

One Pass Data Science In Apache Spark With Generative T Digests - Erik Erlandson

Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal (Paytm)

What is Apache Spark?

Deploying Python Machine Learning Models with Apache Spark-Brandon Hamric & Alex Meyer (Eventbri...

Machine Learning using Spark MLlib | Spark MLlib Tutorial | Edureka | Apache Spark Live - 3

Deep Learning to Big Data Analytics on Apache Spark Using BigDL - Yuhao Yang & Xianyan Jia

Accelerating Apache Spark by Several Orders of Magnitude with GPUs

Riot Games: Improving the Gaming Experience With Apache Spark

Spark Full Course | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn

PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn

Koalas: Making an Easy Transition from Pandas to Apache Spark