Building Petabyte Scale ML Models with Python - DevConf.CZ 2022

preview_player
Показать описание
Speaker: Vaibhav Srivastav

Although building ML models on small/ toy data-set is easy, most production-grade problems involve massive datasets which current ML practices don’t scale to. In this talk, we cover how you can drastically increase the amount of data that your models can learn from using distributed data/ml pipes.

It can be difficult to figure out how to work with large data-sets (which do not fit in your RAM), even if you’re already comfortable with ML libraries/ APIs within python. Many questions immediately come up: Which library should I use, and why? What’s the difference between a “map-reduce” and a “task-graph”? What’s a partial fit function, and what format does it expect the data in? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on…

In this talk, we’ll answer all those questions, and more!

We’ll start by walking through the current distributed analytics (out-of-core learning) landscape in order to understand the pain-points and some solutions to this problem.

Here is a sketch of a system designed to achieve this goal (of building scalable ML models):

1. a way to stream instances
2. a way to extract features from instances
3. an incremental algorithm

Detailed Outline

1. Intro to out-of-core learning
2. Representing large datasets as instances
3. Transforming data (in batches) – live code [3-5]
4. Feature Engineering & Scaling
5. Building and evaluating a model (on entire datasets)
6. Practicing this workflow on another dataset
7. Benchmark other libraries/ for OOC learning
8. Questions and Answers

Key takeaway

By the end of the talk participants would know how to build petabyte scale ML models, beyond the shackles of conventional python libraries.

Participants would have a benchmarks and best case practices for building such ML models at scale.

Рекомендации по теме
visit shbcf.ru