Designing Functional Data Pipelines for Reproducibility and Maintainability | PyData Global 2021

Показать описание

Designing Functional Data Pipelines for Reproducibility and Maintainability
Speaker: Chin Hwee Ong

Summary
Designing reliable and extensible data pipelines at scale is often a challenge, as testing and debugging across compute units are often complex and time-consuming due to dependencies at runtime. In this talk, I will be exploring how the use of functional programming design patterns in Python/Spark enables us to build production-ready data pipelines that are reproducible and maintainable at scale.

Description
When building data pipelines at scale, it is crucial to design data pipelines that are reliable, scalable and extensible according to evolving business needs. Designing data pipelines for reproducibility and maintainability is a challenge, as testing and debugging across compute units (threads/cores/computes) are often complex and time-consuming due to dependencies and shared states at runtime. In this talk, I will be sharing about common challenges in designing reproducible and maintainable data pipelines at scale, and exploring the use of functional programming in Python and Apache Spark to build scalable production-ready data pipelines that are designed for reproducibility and maintainability. Through analogies and realistic examples inspired by data pipeline designs in production environments, you will learn about:

What is Functional Programming, and how it differs from other programming paradigms
Key Principles of Functional Programming
How "control flow" is implemented in Functional Programming
Functional design patterns for data pipeline design in Python and Apache Spark, and how they improve reproducibility and maintainability
Whether it is possible to write a purely functional program
This talk assumes basic understanding of building data pipelines with functions and classes/objects. While the main target audience are data scientists/engineers and developers building data-intensive applications, anyone with hands-on experience in imperative programming (including Python) would be able to understand the key concepts and use-cases in functional programming.

Chin Hwee Ong's Bio
Chin Hwee Ong is a data engineer and aspiring polymath who happens to have a background in aerospace engineering and computational modelling. As a 90% self-taught programmer, Chin Hwee is currently learning Scala for functional programming and has a not-so-secret wish of making data pipelines run faster.

PyData Global 2021

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Рекомендации по теме

Комментарии

what about large data-sets and Pandas dataframes, etc.? does it play nice with these concepts?
(assuming we're NOT in a context like spark or similar)

Klayhamn

Designing Functional Data Pipelines for Reproducibility and Maintainability | PyData Global 2021

Designing Functional Data Pipelines for Reproducibility and Maintainability | PyData Global 2021

Chin Hwee Ong - Designing Functional Data Pipelines for Reproducibility and Maintainability

What is Data Pipeline | How to design Data Pipeline ? - ETL vs Data pipeline (2024)

Chin Hwee Ong - Designing Functional Data Pipelines for Reproducibility and Maintainability

What are Data Pipelines?

Azure Data Pipeline Design in 60 seconds

The BEST library for building Data Pipelines...

'Design Patterns for Data Pipelines' - Lisa Dusseault (PyBay 2023)

Machine Learning | Deep Learning | A complete Introduction

Have you used pipelines for Machine Learning before? #shorts

Designing data pipelines for analytics and machine learning in industrial settings

Designing a Common Data Pipeline for Consistency Across Domains

Building a Real-Time Data Pipeline with PySpark, Kafka, and Redshift | By Darshil Parmar

Create and Activate a Data Pipeline for a Functional Area in Fusion Analytics

Professional Preprocessing with Pipelines in Python

Big Data Pipeline Design and Tuning in PySpark by Rockie Yang

Designing Functional Programs

Back to Basics: Building an Event Driven Serverless ETL Pipeline on AWS

10 Design Patterns Explained in 10 Minutes

Functional Data Engineering with Sven Balnojan

What is Data Pipeline Architecture | How to Design Data Pipeline | Intellipaat

Choosing the right data pipeline design pattern

Rust Data Modelling Without Classes

How to quickly build Data Pipelines for Data Scientists - Geert Jongen | PyData Eindhoven 2021