Redfin Analytics|python ETL pipeline with airflow|Data Engineering Project|Snowpipe|Snowflake|Part 1

preview_player
Показать описание
This is the part 1 of this Redfin Real Estate Data Analytics python ETL data engineering project using Apache Airflow, Snowpipe, snowflake and AWS services.
In this Redfin Real Estate Data Analytics python ETL data engineering project, you will learn how to connect to the Redfin data center data source to extract real estate data using python after which we will transform the data using pandas and load it into an Amazon S3 bucket. The raw data will also be loaded into an Amazon S3 bucket.
As soon as the transformed data lands inside the AWS S3 bucket, Snowpipe would be triggered which would automatically run a COPY command to load the transformed data into a snowflake data warehouse table. We would then connect PowerBi to the snowflake data warehouse to then visualize the data to obtain insight.
Apache airflow would be used to orchestrate and automate this process.
Apache Airflow is an open-source platform used for orchestrating and scheduling workflows of tasks and data pipelines. We would install the Apache-airflow on our EC2 instance to orchestrate the pipeline.
Remember the best way to learn data engineering is by doing data engineering - Get your hands dirty!
If you have any questions or comments, please leave them in the comment section below.
Please don’t forget to LIKE, SHARE, COMMENT and SUBSCRIBE to our channel for more AWESOME videos.

**Books I recommend**

***************** Commands used in this video *****************

***************** USEFUL LINKS *****************

DISCLAIMER: This video and description have affiliate links. This means when you buy through one of these links, we will receive a small commission and this is at no cost to you. This will help support us to continue making awesome and valuable contents for you.
Рекомендации по теме
Комментарии
Автор

This is another exciting lesson from Dr Yemi. Thanks sir. Much respect.

donatus.enebuse
Автор

God bless you and your beautiful soul for taking the time and effort to make this and make it available for free 🙏

stevenlomon
Автор

Hello, you are the best Data Engineering Instructor here on Youtube. I want to continue learning on your end to end project unfortunately I am having a problemin in initiating airflow.

scheduler | [2024-05-05 04:41:04 +0000] [3014] [INFO] Booting worker with pid: 3014
scheduler | {settings.py:60} INFO - Configured default timezone UTC
scheduler | [2024-05-05 04:41:04 +0000] [3016] [INFO] Booting worker with pid: 3016
scheduler | {manager.py:393} WARNING - Because we cannot use more than 1 thread (parsing_processes = 2) when using sqlite. So we set parallelism to 1.

this is the error came from my terminal. Hope you can assist me sir. Thank you and more power

backgrounding
Автор

You are the best instructor out there on YouTube, thank ou for explaining everything. I would like for you to do another data engineering with spark and EMR. Finally, maybe do some video on data analysis and data processing. Finally, please keep going and let us know how we can support you to continue on this amazing path!!!

Nari_Nizar
Автор

very very nice detailed explanation which is very helpfull to understand topic clearly...Thanku so much please keep posting videos sir

madhusudanpatil
Автор

This a great project, you are doing a great job. I am waiting for this same project wiht EMR!!!

assieneolivier
Автор

Great video, can't wait for part 2

koladearisekola
Автор

Why does it takes so long after I command "airflow standalone"?

효캉
Автор

very very very helpful!!! thank you so much!!

quishzhu
Автор

thank you so much for this video. I use a mac and i wanted to know if i could select macos instead of ubuntu and why ubuntu is better

deborahjohnson
Автор

Great Content! I will definitely follow along! Quick question, how much will the EC2 t2 xlarge instance cost be for this whole project?

shumengshi
Автор

How did you made connection with airflow? Is it HTTP?

dipankarmodak
Автор

You don't have to wait 5 min for DAGs to be reloaded. Run this in shell > airflow dags reserialize
No need to stop the airflow, open another shell and activate venv.

joealtona
Автор

Hi there, first of all I just want to appreciate you for helping out in self learning journey of people like us. I closely follow your projects and implement it whenever I can.
For this tutorial as well, I followed everything as you instructed but at last to load data to s3 failed for some reason. My raw data bucket was empty but the transformed data of 867MB was successfully saved. Can you point me towards the right direction here? Also, I'm not using root account and my user has Admin level access.
~Thank you for your time :)

ImBatmanYT_CODM
Автор

Thank you so much for the video, I am getting error "ModuleNotFoundError: No module named 'boto3'" even I installed boto3 following everything as you instructed. Please share your thoughts on this?

vaibhavverma
Автор

Tip: do not use ctrl+c to copy the airflow password as this will also cause the server to shut down 😂

jerbear