Zillow Data Analytics (RapidAPI) | End-To-End Python ETL Pipeline | Data Engineering Project |Part 3

preview_player
Показать описание
This is the part 3 of this Zillow data analytics end-to-end data engineering project where I used Apache Airflow as the orchestration tool.

In this data engineering project, we will learn how to build and automate a python ETL process that would extract real estate properties data from Zillow Rapid API, loads it unto amazon s3 bucket which then triggers a series of lambda functions which then ultimately transforms the data, converts into a csv file format and load the data into another S3 bucket using Apache Airflow. Apache airflow will utilize an S3KeySensor operator to monitor if the transformed data has been uploaded into the aws S3 bucket before attempting to load the data into an amazon redshift.

After the data is loaded into aws redshift, then we will connect amazon quicksight to the redshift cluster to then visualize the Zillow (rapid api data) data.

Apache Airflow is an open-source platform used for orchestrating and scheduling workflows of tasks and data pipelines. This project will entirely be carried out on AWS cloud platform.

In this video I will show you how to install Apache airflow from scratch and schedule your ETL pipeline. I will also show you how to use sensor in your ETL pipeline. In addition, I will show you how to setup aws lambda function from scratch, set up aws redshift and aws quicksight.

As this is a hands-on project, I highly encourage you to first watch the video in its entirety without typing along so that you can better understand the concepts and the workflows after which you should either try to replicate the example I showed without watching the video but consult the video when you are stuck or you could watch the video again the second time in its entirety while also typing along this time.

Remember the best way to learn is by doing it yourself – Get your hands dirty!

If you have any questions or comments, please leave them in the comment section below.

Please don’t forget to LIKE, SHARE, COMMENT and SUBSCRIBE to our channel for more AWESOME videos.

**Books I recommend**

***************** Commands used in this video *****************
sudo apt update
sudo apt install python3-pip
sudo apt install python3.10-venv
python3 -m venv endtoendyoutube_venv
source endtoendyoutube_venv/bin/activate
pip install --upgrade awscli
sudo pip install apache-airflow
airflow standalone
pip install apache-airflow-providers-amazon

***************** USEFUL LINKS *****************

DISCLAIMER: This video and description has affiliate links. This means when you buy through one of these links, we will receive a small commission and this is at no cost to you. This will help support us to continue making awesome and valuable contents for you.
Рекомендации по теме
Комментарии
Автор

I successfully completed this project. Quite enjoyable.

donatus.enebuse
Автор

Learnt a lot of things, completed the project successfully, and was well-taught, Thanks for content.

pranayshah
Автор

Extremely Helpful Project. Such a slow paced and Awesome Explanation which anybody can understand. Thanks..!!

maverick
Автор

I successfully completed this project
Thank you

manojkumaar
Автор

A great project. Thanks a lot for work

HaiDo-bd
Автор

Such a nice and fun project! 🎉 Thanks!

collinsm
Автор

Really great content! I just have one question, at 26:40, the part where you checked the Inbound Rules of VPC Security Group for the Redshift Cluster, is it standard practice to allow all inbound traffic and all IPv4 (Type = All traffic, Source=0.0.0.0/0)? AWS kept showing the warning to limit inbound traffic to only known IP, so I tried to set Inbound Rules to MyIP or the Public IP address of the EC2 instance I'm using to run Airflow, but for some reason step 4 failed to work when I do so (it worked fine if I follow the setting in your video for the inbound rule). Hope to hear your thoughts on this

tuananhdo
Автор

@tuplespectra My S3ToRedshiftOperator task is not changing status in airflow. I have written the correct code, added and updated all the roles, policies and permissions and added the redshift connection in airflow, restarted the server as well.

What am I missing? TIA

sarthakrana
Автор

Really awesome tutorial! I was wondering in the beginning when you intialized airflow it says that its only used for development purposes, if i wanted to use it in airflow in a production setting should i use mwaa instead? Thanks again!

mrcoolguy
Автор

Why didnt you just transform directly after the storage in in the first bucket? Why store the same raw data in two buckets?

Edbwalz
Автор

We had to add rule in inbound rules manually for All traffic, ip4v..0.0.0.0/0

manojkumaar
Автор

Happy
Is there any job openings on this

sudarshanp
Автор

I was wonder, after completing this project, if I want to add it to my portfolio as what I have done, do I need to keep the ec2 and other aws services used running? If no, will potential employers be able to view the dashboard?

oh_willz
Автор

Hi are you working with browse job in Bangalore. The guy who is teaching here, told you will be joining their team to teach us, is this true?

vasanthkumar
join shbcf.ru