filmov
tv
OpenLineage An Open Standard for Data Lineage Ross Turk

Показать описание
OpenLineage An Open Standard for Data Lineage - Ross Turk
A presentation from ApacheCon 2022
Data pipelines may start off simple, organized, and easy to understand--but they never stay that way. When your pipeline consists of hundreds of jobs, spread across different teams, it becomes easy to lose track of them all.
If a job run fails, how can you learn about downstream datasets that have become out-of-date? Can you be confident that they are consuming fresh, high-quality data from their upstream tasks? How might you predict the impact of a planned change on distant corners of the pipeline? These questions become easier once you have a complete understanding of data lineage, the complex set of relationships between all of your jobs and datasets.
OpenLineage is an open framework for collection of lineage metadata. It integrates with pipeline tools like Apache Airflow and Apache Spark to observe and record data transformations. Using this metadata, you can assemble a lineage graph - a picture of your pipeline worth a thousand words.
In this session, you will learn the purpose of data lineage, and its many uses in the modern data stack. You will hear about the basics of OpenLineage, its data model, and how it collects metadata. Finally, you will be introduced to Marquez, an OpenLineage metadata server.
A presentation from ApacheCon 2022
Data pipelines may start off simple, organized, and easy to understand--but they never stay that way. When your pipeline consists of hundreds of jobs, spread across different teams, it becomes easy to lose track of them all.
If a job run fails, how can you learn about downstream datasets that have become out-of-date? Can you be confident that they are consuming fresh, high-quality data from their upstream tasks? How might you predict the impact of a planned change on distant corners of the pipeline? These questions become easier once you have a complete understanding of data lineage, the complex set of relationships between all of your jobs and datasets.
OpenLineage is an open framework for collection of lineage metadata. It integrates with pipeline tools like Apache Airflow and Apache Spark to observe and record data transformations. Using this metadata, you can assemble a lineage graph - a picture of your pipeline worth a thousand words.
In this session, you will learn the purpose of data lineage, and its many uses in the modern data stack. You will hear about the basics of OpenLineage, its data model, and how it collects metadata. Finally, you will be introduced to Marquez, an OpenLineage metadata server.