Data Lineage with Apache Airflow using OpenLineage | Datakin

preview_player
Показать описание

ABOUT THE TALK

As workflows increase in complexity, companies have come to depend on Airflow to manage inter-DAG dependencies. Airflow has quickly become an important component of the Modern Data Stack powering analytical reports, business metrics, and dashboards.

But what effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption were delayed? What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures? How can we use data lineage to achieve the data observability we need to answer these questions?

In this talk, OpenLineage will be introduced, an open standard for collecting lineage metadata for jobs under execution, and how it works with Airflow. The presentation will walk through a practical example using Marquez, the reference implementation of OpenLineage. It will be explained how OpenLineage can help data teams maintain inter-DAG dependencies within their Airflow instance, capture metadata on historical DAG runs, and minimize data quality issues.

ABOUT THE SPEAKERS

Willy Lulciuc is the Founding Engineer of Datakin. He makes datasets discoverable and meaningful with metadata. He co-created Marquez and is now involved in the OpenLineage initiative. Previously, he worked on the Project Marquez team at WeWork. When he’s not reviewing code and creating indirections, he can be found experimenting with analog synthesizers.

ABOUT DATA COUNCIL:

FOLLOW DATA COUNCIL:
Рекомендации по теме
Комментарии
Автор

Data freshness could be considered as part of data quality.

tongweiwang