Apache Spark for Data Science #1 - How to Install and Get Started with PySpark | Better Data Science

preview_player
Показать описание
Want to learn Apache Spark for Data Science? This guide will help you get started. Learn how to install PySpark and load your first dataset with Python.

00:00 Introduction
01:25 Virtual environment setup
03:22 How to start a Spark Session
05:48 How to read datasets with Spark
08:19 Outro

FOLLOW BETTER DATA SCIENCE

FREE “LEARN DATA SCIENCE MASTERPLAN” EBOOK

GEAR I USE
Рекомендации по теме
Комментарии
Автор

Hi Dario,

I downloaded the boston housing dataset from kaggle but it's only showing a single column when I run df.show(). Could you confirm that the dataset still works this tutorial?

Karol-cdne
Автор

To read all columns in this dataset, just use:
df = spark.read.csv("./housing_formated.csv", header=True, inferSchema=True, sep ='\t')

This dataset needs to be formatted first, adding the correct "\t" and the header. Just run in a cell the next:
***
import re
file_name = "./housing.csv"
with open(file_name, "r") as f_in:
lines = f_in.readlines()

with open("./housing_formated.csv", "w+") as f_out:
header = ["crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "b", "lstat", "medv"]

f_out.write("\n")
f_out.writelines("\n".join(
map(
lambda line: "\t".join(re.split(r"\s+", line.strip()))
, lines
))
)
***


It seems that spark does not support separators with regular expressions like pandas ("\s+") as default...

dannysebastiandiazpadilla
Автор

Is it ok if I make a GitHub repo with all of your tutorials on this Apache Spark?

sanyk