Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

Показать описание

Dive into the world of big data processing with our PySpark Practice playlist. This series is designed for both beginners and seasoned data professionals looking to sharpen their Apache Spark skills through scenario-based questions and challenges.

Each video provides step-by-step solutions to real-world problems, helping you master PySpark techniques and improve your data-handling capabilities. Whether preparing for a job interview or just learning more about Spark, this playlist is your go-to resource for practical, hands-on learning. Join us to become a PySpark expert!

In this video, we used DataBricks to create multiple ETL pipelines using the Python API of Apache Spark i.e. PySpark.

We have used sources like CSV, Parquet, and Delta Table then used Factory Pattern to create the reader class. Factory Pattern is one of the most used Low-Level designs in Data Engineering pipelines that involve multiple sources.

Then we used PySpark DataFrame API and Spark SQL to write the business transformation logic. In the loader part, we have loaded data into two fashion one using DataLake and another by Data LakeHouse.

While solving the problems, we are also demonstrating the most asked PySpark #interview problems. We have discussed and demonstrated a lot of concepts like broadcast join, partition by and bucketing, sparkSession, windows functions like LAG and LEAD, delta table and many other concepts.

After watching, please let us know your thoughts,

Stay tuned to all to this playlist for all upcoming videos.

𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:

DataBricks notebooks link. Extract the zip folder by downloading it and then open the HTML files as a notebook in the community version of Databricks.

🔅 Recommended Link for DataBricks community version login, after signing up:

🔅 Ankur's Notebook source files

🔅 Input table files

For practising different Data Engineering interview questions, go to the community section of our YouTube page.

Narrow vs Wide Transformation

Short Article link:

Questions 1:

Question 2:

Question 3:

Question 4:

Question 5:

Question 6:

Question 7:

Question 8:

Question 9:

Question 10:

Broadcast Join in #apachespark

Small article link:

MCQs list

5.

Check the COMMUNITY section for a full list of questions.

Chapters
00:00 - Project Introduction
12:04 - How to use Databricks for any Pyspark/Spark Project?
25:09 - Low-Level Design Code
40:39 - Job, Stages, and Action in Spark
45:22 - Designing a code base for the Spark Project
51:40 - Applying first business Logic in the transformer class
57:34 - Difference between Lag & Lead window function
01:28:42 - Broadcast Join in Apache Spark/pyspark
01:47:50 - Difference between Partitioning and Bucketing in Apache Spark/pyspark
2:07:00 - Detailed Summary of the first pipeline
2:14:00 - Second pipeline Goal
02:24:57 - collect_set() and collect_list() in Spark/pyspark
02:48:53 - Detailed Summary of the second pipeline
02:51:03 - Why is Delta Lake when we already have DataLake?
02:54:51 - Summary

#databricks #delta #pyspark #practice #dataengineering #apachespark #problemsolving
#spark #bigdata #interviewquestions #sql #datascience #dataanalytics

Рекомендации по теме

Комментарии

Good project learning experience Ankur. It took me around 10 hours to debug and write code even after watching you step by step. Nice way to explain complex logics.

LalitSharma-uphl

Please find the link all input files.

Please let me know if you can access it or not.

TheBigDataShow

Really amazing end-to-end DE project, learned a lot in these 3 hours

SaivarunNamburi

Thank you for time and patience to prepare this video. this will definitely help many .

shafimahmed

This is a great demonstration, appreciate the team's effort for putting together an awesome end-to-end project.

shouviksharma

Thank you for doing this project, it is quite enriching experience for learning. I would love to see more of these kind of videos in future. Keep up great work!

RupeshPatel-ryjb

Appreciate your great effort and share your knowledge brother!👍

mohinraffik

I was searching something like this for a long time. Than you for putting this together.. ..Already learning a lot from you ..I would love to connect with you .

pradeepbehera

Excited to learn and implement real-time, Thanks #The_Big_Data_show

anshusharaf

This Channel is simply amazing 😍 Keep coming up with great content on Data Engineering like this

PraveenKumarBN

Just completed this project after a lot of debugging. Got to learn about factory design pattern.
Is this pattern typically used in the production environments? Thank you Ankur for creating such a quality project!

ashwinraje

Good learning experience.
Can you please make video on unit testing of this project., it will really helpful.

prabhatsingh

Just completed this amazing project 😍
Can i add this in my portfolio?

footballalchemist

great explanation, but have small concern about datasets having small data.

manibaddireddy

Appreciate your efforts.. keep it up ❤

swapnilbop

Thanks for your effort but since this is a big data project, shouldn't you use a large file to show Spark techniques you're using?

SAURABHKUMAR-ukgg

Thanks for this videos.
But,
I thinks in real time we would be processing a very large amount of data,
So, It will be great if you can make a video ön processing large amounts of data with all the optimisation techniques we can use.
Thanks in advance.

Amarjeet-fblk

Will i be able to switch into data engineering after watching and practicing the project ? Will i be able to tell my interview that i done this project in my current company?

dante

Hi Ankur, very excited to go through the video, also, are you planning to implement through AWS as well, would be helpful

AshiChaudhary-lctk

Getting Exception during loading to dbfs:

'NoneType' has no attribute 'write'

Any suggestions

Thanks

gowthamm.s

Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

IPL Data Analysis | Apache Spark End-To-End Data Engineering Project

Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

Learn Apache Spark in 10 Minutes | Step by Step Guide

Realtime Data Streaming | End To End Data Engineering Project

Realtime Socket Streaming with Apache Spark | End to End Data Engineering Project

Best Apache Spark Course with Databricks for Data Engineering | 2 End-To-End Projects

Real time End to End PySpark Project

What is Apache Spark? Learn Apache Spark in 15 Minutes

Pyspark DataFrames, Merge and Data Delta | Azure Data Engineer Training | #sqlschool

Spark Kafka Cassandra | End to End Streaming Project

Learn Apache Airflow in 10 Minutes | High-Paying Skills for Data Engineers

Data Engineering Course for Beginners

PySpark Tutorial for Beginners

How I would learn Data Engineering (if I could start over)

God Tier Data Engineering Roadmap 2024 with End-To-End Projects

Building a Batch Data Pipeline using Airflow, Spark, EMR & Snowflake

📈 Stock Market Real-Time Data Analysis Using Kafka | End-To-End Data Engineering Project

End-To-End Data Engineering Project in 40 Minutes | AWS Cloud | PySpark

Top 5 FREE Resources to 10X Your Data Engineering Skills

How I Would Learn Data Engineering 2024 (If I could start over)

PySpark Tutorial

End to End Pyspark Project | Pyspark Project

Realtime Change Data Capture Streaming | End to End Data Engineering Project

How I Use Python as a Data Engineer