The Realities Of Airflow - The Mistakes New Data Engineers Make Using Apache Airflow

preview_player
Показать описание
Airflow remains a popular choice when it comes to open-source orchestration tools.

When I surveyed people about a year ago now, it was the most popular open-source solution, and still to this day, my video on “Should You Use Airflow” drives a lot of prospect conversations.

Now, I do want to say that there are plenty of organizations using Azure Data Factory and Informatica, and there are plenty of competitors knocking on Airflow's door.

But for now, Airflow is like the PHP of the data world; people can talk poorly about it, but it continues to be heavily relied upon.

Now, as I said, Airflow is often why I get brought into many projects, meaning I have seen many different ways that teams decide to deploy Airflow.

Some scaled, others didn’t.

Thus, I wanted to take a moment and discuss some ways I have seen Airflow deployed in the past and the challenges people faced as they deployed their code.

0:00 - Intro
1:44 - Mistake #1 Putting The DAG Folder In The Same Repo As The Webserver
4:58 - Mistake #2 Not Using All The Features Airflow Offers
8:43 - Mistake #3 Not Thinking About Scale

Looking for an alternative to Airflow, check out this article!

If you enjoyed this video, check out some of my other top videos.

Top Courses To Become A Data Engineer In 2022

What Is The Modern Data Stack - Intro To Data Infrastructure Part 1

If you would like to learn more about data engineering, then check out Googles GCP certificate

If you'd like to read up on my updates about the data field, then you can sign up for our newsletter here.

Or check out my blog

And if you want to support the channel, then you can become a paid member of my newsletter

Tags: Data engineering projects, Data engineer project ideas, data project sources, data analytics project sources, data project portfolio

_____________________________________________________________
_____________________________________________________________
About me:
I have spent my career focused on all forms of data. I have focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. I have also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. I privately consult on data science and engineering problems both solo as well as with a company called Acheron Analytics. I have experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.

*I do participate in affiliate programs, if a link has an "*" by it, then I may receive a small portion of the proceeds at no extra cost to you.
Рекомендации по теме
Комментарии
Автор

Thanks, awesome video. I scaled by deploying Airflow on kubernetes and using the kubernetes executor so that jobs continue to run during deploys. We do have the problem of the dags being in the same repo as our image so running the kubernetes executor was a happy medium. I plan to move the dags to another repo and use a git-sync sidecar container to pull in dag updates at a scheduled interval

MarcusJFloyd
Автор

This video sets an exceptional benchmark! -- "Value the journey, for it shapes your path towards unprecedented accomplishments."

makedaily
Автор

Hey Ben,
Watched loads of your videos, and in one of your older ones (or a comment afterwards) you mentioned udacity being s good resource for aspiring professionals, but wishing they had a data engineer nanodegree. Now that they do have (a couple of platform specific) de nanodegrees, is it something you have looked at? Potentially my company is willing to fund a course for me, hoping to move into the database from help desk support. Wondering if that would be a good course to get into.

Thanks
Olie

rnzqt
Автор

...how about looking at more "modern" alternatives to Airflow? Dagster, Prefect etc. What do you think about their deployment?

neuronqro
Автор

I so much trauma from trying to Deploy Airflow 3 separate times at 3 different orgs prior to the "Managed Airflow" era (AWS, Astronomer) that I can't even watch this video.

Ultimately, I prefer to work in organizations that are generally smaller, more intimate and greater ownership of their own orchestration locally save for when they have data sets that might be agreed upon to be mission critical at the organizational level and ergo that data set moves to the "hub" where a data mesh like governance system may also take on those data sets in a "hub and spoke" like vibe.

paul_devos
Автор

literally been trying to deploy Airflow in the past 3 days

jerbear