How to build and automate your Python ETL pipeline with Airflow | Data pipeline | Python

preview_player
Показать описание
In this video, we will cover how to automate your Python ETL (Extract, Transform, Load) with Apache Airflow. In this session, we will use the TaskFlow API introduced in Airflow 2.0. TaskFlow API makes it much easier to author clean ETL code without extra boilerplate, by using the @task decorator. Airflow organizes your workflows as Directed Acyclic Graphs (DAGs) composed of tasks.


In this tutorial, we will see how to design a ETL Pipeline with Python. We will use SQL Server’s AdventureWorks database as a source and load data in PostgreSQL with Python. We will focus on Product's hierarchy and enhance our initial data pipeline to give you a complete overview of the extract, load and transform process.

Subscribe to our channel:

---------------------------------------------
Follow me on social media!

---------------------------------------------

#ETL #Python #Airflow

Topics covered in this video:
0:00 - Introduction to Airflow
2:49 - The Setup
3:40 - Script ETL pipeline: Extract
5:52 - Transform
7:39 - Load
8:00 - Define Direct Acyclic Graph (DAG)
9:36 - Airflow UI: Dag enable & run
10:09 - DAG Overview
10:29 - Test ETL Pipeline
Рекомендации по теме
Комментарии
Автор

Thank you sir, you helped me understand airflow, and I did the same thing following the same process but from mysql - extract-load -> transformation -> load with free employees database and I did share it on my github and linkedin tagging this video.

kelbfuh
Автор

I want to thank you for posting this content. It is helpful in many ways.

tevintemu
Автор

Very nice videos and blog! Keep up the good work!

franciscmoldovan
Автор

simple and godlike understandable 10/10

lasopainstantaneasuprema
Автор

This video was a great resource. Thanks for the tutelage and your take on it.

demohub
Автор

Hey, this is really helpful. It would be even more insightful if you provided or suggested ways to run this process (along with those described in this recent series of tutorials) in the cloud or in a server less environment. Thanks!

GiovanniDeCillis
Автор

Thank you, i have a question If the task is scheduled to run daily and new data has been inserted into the source since the last transfer, will just new data get transferred next task or all data again

saadlechhb
Автор

excellent tutorial, thank you ! it would be great if you could split the tasks in several files, need to learn how to do this

rjrmatias
Автор

What happens if you kill the airflow web server, or localhost? Will the DAG still run on the schedule you specified?

ryanschraeder
Автор

Thank you a lot. I'm trying to understand hot to create pipeline. I want to be expert on this and be a good Data Engineering. Professional.

crisybelkis
Автор

Great playlist. Your method of building the videos is very practical and lovable.

One Question: How can you perform the "paste" line by line in the recording? is it ctrl+y after so many ctrl+z ops?

Obtemus
Автор

Was able to connect to import airflow.providers.oracle and oracle hook. However, when I use OracelHook, it keeps throwing an error saying ‘conn_id’ not found even thought the connection has been configured fine via the airflow UI. Do you have any idea? What could go wrong ?

eunheechoi
Автор

is it better to extract the entire tables to the stagings and then final table with consolidated data ?? why?
wouldn't recommend to have queried data first (extract) and then transforming and loading it to the final table?
I am trying to understand why the first approach is better over the latter....

eunheechoi
Автор

Hi, Sir! Thanks for the insightful video! However I'd like to ask if we need to place the ETL python file in a particular folder for it to be recognized as a DAG by Airflow?

hellowillow_YT
Автор

why didn't you transform the data after you extract the data from mssql itself and then load the final data to postgresql ?

CriticalThinker
Автор

Hi, how dags airflow mssql to gcs please, i will to build data warehouse in bigquery.

abnormalirfan
Автор

Thanks for the video. One question is why you didn't use Airflow's built-in PostgresHook or PostgreSQLOperator instead of SQLAlchemy and Pandas. I think this would simplify the code and make it more consistent with the way the SQL server connection is established using MsSqlHook.

mrg
Автор

Could you please provide more details on what you mean by "table fact of the datawarehouse"?

ChaimaHermi-zipq
Автор

in load_src_data, can we not use postgres connection object(conn) to load the data instead of using create_engine? because we need to know all the same connection details again which we used to create connection id in airflow

ulhasbhagwat
Автор

Question: Do you have a video to build pipeline to move data from Postgres Server to SQL Server?

shrutibangad