Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

preview_player
Показать описание
Dive into the world of big data processing with our PySpark Practice playlist. This series is designed for both beginners and seasoned data professionals looking to sharpen their Apache Spark skills through scenario-based questions and challenges.

Each video provides step-by-step solutions to real-world problems, helping you master PySpark techniques and improve your data-handling capabilities. Whether preparing for a job interview or just learning more about Spark, this playlist is your go-to resource for practical, hands-on learning. Join us to become a PySpark expert!

In this video, we used DataBricks to create multiple ETL pipelines using the Python API of Apache Spark i.e. PySpark.

We have used sources like CSV, Parquet, and Delta Table then used Factory Pattern to create the reader class. Factory Pattern is one of the most used Low-Level designs in Data Engineering pipelines that involve multiple sources.

Then we used PySpark DataFrame API and Spark SQL to write the business transformation logic. In the loader part, we have loaded data into two fashion one using DataLake and another by Data LakeHouse.

While solving the problems, we are also demonstrating the most asked PySpark #interview problems. We have discussed and demonstrated a lot of concepts like broadcast join, partition by and bucketing, sparkSession, windows functions like LAG and LEAD, delta table and many other concepts.

After watching, please let us know your thoughts,

Stay tuned to all to this playlist for all upcoming videos.

𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:

DataBricks notebooks link. Extract the zip folder by downloading it and then open the HTML files as a notebook in the community version of Databricks.

🔅 Recommended Link for DataBricks community version login, after signing up:

🔅 Ankur's Notebook source files

🔅 Input table files

For practising different Data Engineering interview questions, go to the community section of our YouTube page.

Narrow vs Wide Transformation

Short Article link:

Questions 1:

Question 2:

Question 3:

Question 4:

Question 5:

Question 6:

Question 7:

Question 8:

Question 9:

Question 10:

Broadcast Join in #apachespark

Small article link:

MCQs list

5.

Check the COMMUNITY section for a full list of questions.

Chapters
00:00 - Project Introduction
12:04 - How to use Databricks for any Pyspark/Spark Project?
25:09 - Low-Level Design Code
40:39 - Job, Stages, and Action in Spark
45:22 - Designing a code base for the Spark Project
51:40 - Applying first business Logic in the transformer class
57:34 - Difference between Lag & Lead window function
01:28:42 - Broadcast Join in Apache Spark/pyspark
01:47:50 - Difference between Partitioning and Bucketing in Apache Spark/pyspark
2:07:00 - Detailed Summary of the first pipeline
2:14:00 - Second pipeline Goal
02:24:57 - collect_set() and collect_list() in Spark/pyspark
02:48:53 - Detailed Summary of the second pipeline
02:51:03 - Why is Delta Lake when we already have DataLake?
02:54:51 - Summary

#databricks #delta #pyspark #practice #dataengineering #apachespark #problemsolving
#spark #bigdata #interviewquestions #sql #datascience #dataanalytics
Рекомендации по теме
Комментарии
Автор

Good project learning experience Ankur. It took me around 10 hours to debug and write code even after watching you step by step. Nice way to explain complex logics.

LalitSharma-uphl
Автор

Please find the link all input files.


Please let me know if you can access it or not.

TheBigDataShow
Автор

Really amazing end-to-end DE project, learned a lot in these 3 hours

SaivarunNamburi
Автор

Thank you for time and patience to prepare this video. this will definitely help many .

shafimahmed
Автор

This is a great demonstration, appreciate the team's effort for putting together an awesome end-to-end project.

shouviksharma
Автор

Thank you for doing this project, it is quite enriching experience for learning. I would love to see more of these kind of videos in future. Keep up great work!

RupeshPatel-ryjb
Автор

Appreciate your great effort and share your knowledge brother!👍

mohinraffik
Автор

I was searching something like this for a long time. Than you for putting this together.. ..Already learning a lot from you ..I would love to connect with you .

pradeepbehera
Автор

Excited to learn and implement real-time, Thanks #The_Big_Data_show

anshusharaf
Автор

This Channel is simply amazing 😍 Keep coming up with great content on Data Engineering like this

PraveenKumarBN
Автор

Just completed this project after a lot of debugging. Got to learn about factory design pattern.
Is this pattern typically used in the production environments? Thank you Ankur for creating such a quality project!

ashwinraje
Автор

Good learning experience.
Can you please make video on unit testing of this project., it will really helpful.

prabhatsingh
Автор

Just completed this amazing project 😍
Can i add this in my portfolio?

footballalchemist
Автор

great explanation, but have small concern about datasets having small data.

manibaddireddy
Автор

Appreciate your efforts.. keep it up ❤

swapnilbop
Автор

Thanks for your effort but since this is a big data project, shouldn't you use a large file to show Spark techniques you're using?

SAURABHKUMAR-ukgg
Автор

Thanks for this videos.
But,
I thinks in real time we would be processing a very large amount of data,
So, It will be great if you can make a video ön processing large amounts of data with all the optimisation techniques we can use.
Thanks in advance.

Amarjeet-fblk
Автор

Will i be able to switch into data engineering after watching and practicing the project ? Will i be able to tell my interview that i done this project in my current company?

dante
Автор

Hi Ankur, very excited to go through the video, also, are you planning to implement through AWS as well, would be helpful

AshiChaudhary-lctk
Автор

Getting Exception during loading to dbfs:

'NoneType' has no attribute 'write'

Any suggestions

Thanks

gowthamm.s