Talk: Daniel Imberman - Bridging Data Science and Data Infrastructure with Apache Airflow

Показать описание

Presented by:
Daniel Imberman

When supporting a data science team, data engineers are tasked with building a platform that keeps a wide range of stakeholders happy. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. Collaboration between these stakeholders can be difficult, as every data science pipeline has a unique set of constraints and system requirements (compute resources, network connectivity, etc). For these reasons, data engineers strive to give their data scientists as much flexibility as possible, while maintaining an observable and resilient infrastructure.

In recent years, Apache Airflow (a Python-based task orchestrator developed at Airbnb) has gained popularity as a collaborative platform between data-centric Pythonistas, and infrastructure engineers looking to spare their users from verbose and rigid yaml files. Apache Airflow exposes a flexible pythonic interface that can be used as a collaboration point between data engineers and data scientists. Data engineers can build custom operators that abstract details of the underlying system and data scientists can use those operators (and many more) to build a diverse range of data pipelines.

For this 30 minute talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, hyper-parameter-tuned, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.

PyCon US

Рекомендации по теме

Talk: Daniel Imberman - Bridging Data Science and Data Infrastructure with Apache Airflow

Talk: Daniel Imberman - Bridging Data Science and Data Infrastructure with Apache Airflow

Daniel Imberman & Gonzalo Diaz: Data Products, From 0 to data science pipeline | PyData Córdoba

What's New In Airflow 2.0 and Kubernetes

Run complex HCS Pipelines with CellProfiler with Apache Airflow

Talk Python: Apache Airflow Open-Source Workflow with Python Talk Python to Me Ep.330

Talk (Data - Day 1) - 5 Recipes to Fashionable Airflow Data Engineering Pipelines

Running Apache Airflow Reliably with Kubernetes | Astronomer

Keep the Space Shuttle Flying: Writing Robust Operators - Illya Chekrygin, Upbound

Using Data Science to Add Insight to Infrastructure

PyWaw #92 - Introduction to Airflow

Data Analytics Part 3 - data infrastructure

Roman Tezikov: Comparison of frameworks for pipelines automation: Airflow vs. Prefect (RUS)

Dont just go with the flow using Airflow (D2-11:15)

Airflow 2.0 Series - HA Scheduler - PART 6

Build your own data warehouse for personal analytics with SQLite and Datasette

Talk: Vinayak Mehta - The Hitchhiker's Guide to CLIs in Python

Big Data Infrastructure in the Cloud with Liftigniter

Talk: Takanori Suzuki - Automate the Boring Stuff with Slackbot

How do Data Science Workers Collaborate? Roles, Workflows, and Tools

Talk: Colin Carroll - Getting started with automatic differentiation

Talk: Wendi Dreesen - Mixing a Raspberry Pi with Python into a 5th Grade Science Fair Project

'dask-image: distributed image processing for large data' - Genevieve Buckley (PyConline A...

Running Airflow Reliably With Kubernetes - Viraj Parekh (Astronomer)

Keynote Ákos Hochrein - PyCon Sweden 2020 Day 1