CI/CD for Data - Building Data Development Environment with lakeFS

preview_player
Показать описание

A property of data pipelines one might observe is that they rarely stay still. Instead, there are near-constant updates to some aspect of the infrastructure they run on, or in the logic they use to transform data, to give two examples.

To efficiently apply the necessary changes to a pipeline requires running it parallel to production to test the effect of a change. Most data engineers would agree that the best way to do this is far from a solved problem.

Most attempts at doing this fall on one of two extremes--either executed with overly simplified hardcoded sample data that let through errors that will appear with production data. Or, executed in a maintenance-intensive dev environment that requires duplicating the production data, which also ends up massively increasing the risk of a breach or data privacy violation.

The open source project lakeFS lets one find the much-needed middle ground for testing data pipelines by making it possible to instantly clone a data environment through a zero-copy cloning operation. This enables a safe and automated development environment for data pipelines that avoids the pitfalls of copying or mocking datasets, and using production pipelines to test.

In this session, you will learn how to use lakeFS to quickly set up a development environment and use it to develop/test data products without risking production data.
Рекомендации по теме
Комментарии
Автор

why this video doesnt have more views. her explanation is so good and the flow of ideas just follow each other so naturally

voxdiary