Don't Use Apache Airflow

preview_player
Показать описание
Apache Airflow is touted as the answer to all your data movement and transformation problems but is it? In this video, I explain what Airflow is, why it is not the answer for most data movement and transformation needs, and provide some better options.

Join my Patreon Community and Watch this Video without Ads!

Slides

Follow me on Twitter
@BryanCafferky

Follow Me on LinkedIn
Рекомендации по теме
Комментарии
Автор

I wasn't using it but after this video I just changed my mind. I'm gonna schedule some jobs using Airflow next sprint.

wexwexexort
Автор

I was recently brought onto a team to convert our ETLs from Apache Nifi over to Airflow and while your assessment is fine, I think there's a few areas where I would have structured this differently.

1. Airflow is not an ETL, you're right in calling it a job scheduler, it's technically referred to as a task scheduler. In your ETL processes you have really 4 things that you're trying to do-
a. trigger when an event happens (an email is received, x amount of time has passed, someone put a file in your fileshare or s3 bucket, some notification prompts you to start).
b. Extract your data from one location.
c. Transform your data. This is where the bulk of your coding comes into play
d. put your data into it's appropriate database or storage
e. make sure a-d goes off without an issue.

The reason why Airflow is a great ETL tool is because it does A and E by itself reall well, and it facilitates B and D. Hooks and sensors are built into airflow, and are fully customizable. If your project is reliant on programs like Glue then you can do all of this in the AWS suite (or Azure or GCP), but if Airflow very cleanly packages up your connection points and your custom etl and runs that sequence of tasks beautifully. Should you default to airflow? If your data engineers are already experts it's fine, if not, then no. Is it the magic tool to ETL? No, watch for AWS and fellow tech giants to come out with something like that in the next 5-10 years. Is it the best task scheduler? Due to support it's miles ahead of its competitors, so yes.

Seatek_Ark
Автор

I've been using Airflow for a little over a year and your video really confirmed that a lot of the things that have been bugging me about it are not really a me problem.

I really love how powerful it is, but having been using it mostly for ETL, I've often found myself overwhelmed with all the coupling and the little "gotchas" in the form of how specifically things have to be set up. It adds a lot of overhead from the get-go, and importantly, means that no matter how well designed the business code is, whenever something breaks or needs to be changed I always need to re-learn all of the Airflow-specific code. I can see why it's a favorite for specialized data teams whose main job is maintaining data pipelines, but not for use cases like mine in which the data flow management is just a small part of the job. So not really anything wrong with Airflow, just that it might be overkill for users like myself.

I'm going to look into some of the ETL tools you mentioned, and one thing I'm very interested in using Airflow for soon is managing 3D rendering pipelines. I think it's going to be fantastic for coordinating render jobs and their individual frames, which are often in the thousands.

DodaGarcia
Автор

Dear Bryan, thank you for your informative video! For me personally it is actually great news that Airflow IS NOT a full-fledged ETL tool, this is actually exactly what I need. I honestly don't see mentioned limitations (no ETL functionality) as a disadvantage. ETL as a concept is also becoming outdated, in the wake of new approaches such as data mesh and service mesh solutions. What is definitely a no-no is the amount of code overhead and the strong coupling. Will definitely look into suggested tools.

ariocecchettini
Автор

Although I agree with most of what was said in this video I do have some comments that would likely change someone's mind as it pertains to using Airflow in a real world business scenario. I agree Airflow is not an ETL/ELT tool. I would agree that it is a scheduler. I disagree that code is not reusable. That's one of the reasons why providers and operators exist. If you want to use the same set of tasks multiple times inside the current project or across multiple projects, create a custom operator and use it where you wish.

If you are running a medium to large business and the company/IT philosophy is to adopt products that have vendor support, then NiFi and Kettle are not going to be for you. There is no one to call for support when your production instance of either of those goes down. With Airflow a business has the ability to go with Astronomer for a fully vendor supported and highly automated solution which doesn't require the heavy lift of setup.

Anyone saying they use AWS Glue and love it, has either not used it or is lying to you. Simply put, it's got a long ways to go to catch up with most orchestrator type tools like Azure Data Factory. If you are in a situation where your company has chosen AWS as their cloud provider and Snowflake as their cloud data warehouse, your options are limited for orchestration of workflow which is a major playing in a complete data pipeline strategy. Products like Matillion are great for drag and drop functionality but are expensive and have a huge deficiency in deployment pipelines and ci/cd implementation. If you are living in the cloud data space and don't know Python at least at a basic level, there is a good chance you are entry level and will need to learn it at some point or not very effective at putting together data pipelines. One of the most powerful module/libraries/etc available to someone in the data space is the Pandas Python module. This becomes a very powerful tool in Airflow or any other orchestration engine dealing with data movement.

Just my 2 cents. Again, I don't disagree with what was said. I just think there are way more valid use cases and reasons to use Airflow then insinuated.

bnmeier
Автор

I'm a developer in data analytics team. And now I'm setting up an apach airflow for my team. They will create dags using jupiter lab and it will very comfortable.

-MaCkRage-
Автор

Airflow is a scheduler and it doesn't care about what code you run. The easiest is to pack all your golang/rust/python code in docker containers and scale with that.

gudata
Автор

As someone coming from SSIS and literally hate it for being all too much graphical interface,
I have to say you did a good job about describing the problems with AirFlow.

yevgenym
Автор

In most point I agree... Airflow is not an ELT Tool. It's an orchestrator, in my opinion the best in the world. In the company where I work I built up a BI for online activities. I tried a lot of tools, don't want to mention them all. But they all had a lot of draw backs and where expensive. I ended up using Airflow and I'm pretty happy with it. Sure, it's all code! That's what you have to keep in mind. Other tools like DBT, Airbyte and so on integrate perfectly into Airflow. So scheduling and monitoring the entire pipeline is absolutely great. On the other hand I had to struggle with a lot of data sources where out of the box tools had problems understanding the data. In the end I had to program a middle ware in python to make the data compatible with these tools. Now in Airflow it works inside the Airflow environment. Due to the fact Airflow delivers a lot of good operators the code got even smaller. Furthermore the Docker (Compose) images are great and the Helm Charts are good... So yes: it's not a native ELT tool. You have to use code only... But with code only comes a lot of flexibility. Don't want to go back the kettle, talend or SAP Data Services. What looks interesting is NiFi...

janHodle
Автор

Been using airflow a little over a year now and totally agree with most of your points. Appreciate it for logging, monitoring of pipelines and the visualizations. Also the good K8s integrations and active community. Would recommend it if most of your code to orchestrate is Python or dockerized. It does come with some downsides like the lack of pipeline version management or the complex setup. There are managed versions though, e.g. Cloud Composer

tomhas
Автор

Love the video. Definitely made me think and gave me some good tools to look into.

A few notes here (I'm an Airflow noob, but I've at least used it...)
1. It doesn't really work on Windows like it says in the screenshot at the beginning - unless you're using Docker or WSL. It only works on Linux.
2. It does not only support Python. As you mention, there's a BashOperator, which means it can run anything using a bash script (python, JavaScript, php script, Java app, C# console app, etc).
3. I think it's a bit disingenuous to say your DAG code could be more than your actual code running - the DAG definitions are insanely simple... your examples are probably about as complex as 70% of jobs (outside of the actual logic).
4. All the alternate solutions you present also have overhead to learn and their own proprietary outputs (that can't be reused anywhere else - except maybe Data Factory, which might be able to port into SSIS on-prem or whatever). A Python script (or whatever script - Powershell, C# app, etc) can run just about anywhere.
5. Instead of putting your Python logic inside the script, you can just use a BashOperator to run the Python script (ie: "python3 -m path/to/thescript.py") - which means you can decouple and use the script part anywhere and only the DAG definition is the only thing specific to Airflow (which is... trivial most of the the time). This might not work if you have complex dependencies between your scripts - mine were always fairly linear jobs like: move data to cloud, train ML model, run batch model outputs, do something with the outputs, update some API.

I'll just say... if you're currently running C# console and Python script jobs on Windows Server Scheduler (which is where I'm coming from, lol!), Airflow is an awesome tool that's super easy to get started with. We didn't end up using it because it was Linux-only and our infra team is scared of Linux (and Docker... and WSL2...).

Jeffsdata_
Автор

This is amazing. Rarely anyone is so fair in evaluating popular tool like airflow

lahvoopatel
Автор

A main issue of defining a function inside of another function is that it's impossible to unittest. But testing is vital for data processing. it looks like all tasks should be written and tested as standalone functions and adapted to airflow by additional abstraction layer.

igoryurchenko
Автор

Beautifully explained! I love how you dive into the code without getting lost in the weeds. Very helpful, thank you :)

sanjaybhatikar
Автор

whether I end up using airflow or not, this is a great video that clearly explains how to use the tool and your perspective. thank you!

Theoboeguy
Автор

I am studying Apache NiFi now, it looks like a good tool for ETL purpose, thanks for your comments.

shutaozhang
Автор

writing 800 lines of code to schedule a job in airflow..i totally agree with you..its a Pain in the wrong place

abhinee
Автор

Thank you for POV. Take a look at Dbt too from Fishtown Analytics. I think version control needs to be a core requirement for any tool that is responsible for moving data. This might be a problem if the solution isn’t code-based.

ben.morris
Автор

As I know, Airflow is used for "scheduling" the ETLs; not "creating" the ETLs. So, can you perform both "creating" and "scheduling" operations via AWS Glue?

halildurmaz
Автор

KubernetesPodOperator can be used to run any docker images using Airflow.

goutham