CI CD for Data Lakes part 2

preview_player
Показать описание
Today, data lakes offer many advantages for cloud users. Everyone who needs storage such as ObjectStore, S3, Azure Blob, and others, can leverage data lakes. They are scalable, cost-effective, relatively easy to use, and have high throughput and a rich application ecosystem. Yet, with data-intensive systems that are a combination of OSS and cloud native services, there are also challenges. As data practitioners, it's tough to experiment, compare, and reproduce data-intensive transactions. Copying large-scale data for experimentation gets pricey. On top of the expense is the difficulty of enforcing data best practices like schema, since it can change on the fly when ingesting data from outside sources. Lastly, it’s also hard to ensure high-quality data. To start working on solutions to these problems, it’s necessary to acknowledge that our systems are made up of data and code. There are tools such as Git and CI/CD to manage code. Why not apply the same logic to data? With open source tools like lakeFS, it’s possible to manage data at scale using Git-like capabilities.
Рекомендации по теме