PySpark Tutorial : Intro to data cleaning with Apache Spark

Показать описание

---

Welcome to Data Cleaning in Apache Spark with Python. My name is Mike Metzger, I am a Data Engineering Consultant, and I will be your instructor for this course. We will cover what data cleaning is, why it's important, and how to implement it with Spark and Python. Let's get started!

In this course, we'll define "data cleaning" as preparing raw data for use in processing pipelines. We'll discuss what a pipeline is later on, but for now, it's sufficient to say that data cleaning is a necessary part of any production data system. If your data isn't "clean", it's not trustworthy and could cause problems later on.

There are many tasks that could fall under the data cleaning umbrella. A few of these include reformatting or replacing text; performing calculations based on the data; and removing garbage or incomplete data.

Most data cleaning systems have two big problems: optimizing performance and organizing the flow of data.

A typical programming language (such as Perl, C++, or even standard SQL) may be able to clean data when you have small quantities of data. But consider what happens when you have millions or even billions of pieces of data. Those languages wouldn't be able to process that amount of information in a timely manner. Spark lets you scale your data processing capacity as your requirements evolve.

Beyond the performance issues, dealing with large quantities of data requires a process or pipeline of steps. Spark allows the management of many complex tasks within a single framework.

Here's an example of cleaning a small data set. We're given a table of names, age in years, and a city. Our requirements are for a DataFrame with first and last name in separate columns, the age in months, and which state the city is in. We also want to remove any rows where the data is out of the ordinary.
Using Spark transformations, we can create a DataFrame with these properties and continue processing afterward.

A primary function of data cleaning is to verify all data is in the expected format. Spark provides a built-in ability to validate datasets with schemas. You may have used schemas before with databases or XML; Spark is similar. A schema defines and validates the number and types of columns for a given DataFrame.

A schema can contain many different types of fields - integers, floats, dates, strings, and even arrays or mapping structures.

A defined schema allows Spark to filter out data that doesn't conform during read, ensuring expected correctness.

In addition, schemas also have performance benefits. Normally a data import will try to infer a schema on read - this requires reading the data twice. Defining a schema limits this to a single read operation.

Here is an example schema to the import data from our previous example.

We've gone over a lot of information regarding data cleaning and the importance of dataframe schemas. Let's put that information to use and practice!

#DataCamp #PySparkTutorial #CleaningDatawithPySpark

Рекомендации по теме

PySpark Tutorial : Intro to data cleaning with Apache Spark

PySpark Tutorial

Apache Spark / PySpark Tutorial: Basics In 15 Mins

PySpark Tutorial for Beginners

1. pyspark introduction | pyspark tutorial for beginners | pyspark tutorial for data engineers

What Is Pyspark? | Introduction to Pyspark | Why Use Pyspark? | Pyspark For Beginners | Simplilearn

Tutorial 1-Pyspark With Python-Pyspark Introduction and Installation

PySpark Tutorial: Spark SQL & DataFrame Basics

The ONLY PySpark Tutorial You Will Ever Need.

PySpark Challenges #1: How To Manually Create a PySpark Dataframe & Check Schema Datatypes.

PySpark Tutorial in 60 Minutes | Introduction to Apache Spark with Python | Edureka Live

PySpark Full Course [2024] | Learn PySpark | PySpark Tutorial | Edureka

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Training | Edureka

Basics Of Pyspark For Beginners | Beginners Guide To Pyspark | Pyspark Training | Simplilearn

PySpark Tutorial : Intro to data cleaning with Apache Spark

Spark Full Course | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn

What Is Apache Spark?

PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Intellipaat

PySpark Tutorial | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn

Spark Introduction | PySpark Tutorial for Beginners

Pyspark Dataframe Tutorial | Introduction to Pyspark Dataframes | Pyspark Training | Simplilearn

PySpark Tutorial [Full Course] 💥

Intro To Databricks - What Is Databricks

Introduction to Pyspark - 01 | PySpark Tutorial for Beginners

PySpark Tutorial 1, Introduction To Apache Spark, #SparkArchitecture, #PysparkTutorial, #Databricks