Building Petabyte Scale ML Models with Python - DevConf.CZ 2022

Показать описание

Speaker: Vaibhav Srivastav

Although building ML models on small/ toy data-set is easy, most production-grade problems involve massive datasets which current ML practices don’t scale to. In this talk, we cover how you can drastically increase the amount of data that your models can learn from using distributed data/ml pipes.

It can be difficult to figure out how to work with large data-sets (which do not fit in your RAM), even if you’re already comfortable with ML libraries/ APIs within python. Many questions immediately come up: Which library should I use, and why? What’s the difference between a “map-reduce” and a “task-graph”? What’s a partial fit function, and what format does it expect the data in? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on…

In this talk, we’ll answer all those questions, and more!

We’ll start by walking through the current distributed analytics (out-of-core learning) landscape in order to understand the pain-points and some solutions to this problem.

Here is a sketch of a system designed to achieve this goal (of building scalable ML models):

1. a way to stream instances
2. a way to extract features from instances
3. an incremental algorithm

Detailed Outline

1. Intro to out-of-core learning
2. Representing large datasets as instances
3. Transforming data (in batches) – live code [3-5]
4. Feature Engineering & Scaling
5. Building and evaluating a model (on entire datasets)
6. Practicing this workflow on another dataset
7. Benchmark other libraries/ for OOC learning
8. Questions and Answers

Key takeaway

By the end of the talk participants would know how to build petabyte scale ML models, beyond the shackles of conventional python libraries.

Participants would have a benchmarks and best case practices for building such ML models at scale.

DevConf

Рекомендации по теме

Building Petabyte Scale ML Models with Python - DevConf.CZ 2022

Building Petabyte Scale ML Models with Python - DevConf.CZ 2022

#bbuzz: Vaibhav Srivastav - Building Petabyte Scale ML Models with Python

Building Petabyte Scale ML Models with Dask/ Tensorflow - Vaibhav Srivastav - PyCon Korea 2020

Using AI and Edge Infrastructure to Dynamically Analyze Petabyte Scale Data

Enabling Petabyte-scale Ocean Data Analytics- Thomas Nicholas, Julius Busecke | SciPy 2022

Build ML models at scale with Amazon SageMaker Studio Notebooks | Amazon Web Services

Building A Petabyte Scale Warehouse in BigQuery (Cloud Next '18)

Processing a Petabyte of Planetary Pixels with Python | SciPy 2016 | Samuel Skillman

AWS Summit DC 2022 - Innovate with petabyte-scale open data in the life sciences

AWS re:Invent 2020: How Nielsen built a multi-petabyte data platform using Amazon EMR

Building metric platform using Flink for massive scale at Netflix - Abhay Amin

Petabyte Scale Anomaly Detection Using R & Spark

Processing data at scale with Google Cloud Bigtable, Dataflow, and Dataproc (Google Cloud Next &apos...

Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urbanowicz, Box

Scaling your data from concept to petabytes - Google I/O 2016

Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query : Build 2018

Petabyte-scale lakehouses with dbt and Apache Hudi

How Starbucks is Achieving Enterprise Data and ML at Scale | Keynote Spark + AI Summit 2020

Apache Cassandra Analytics: A Recipe to Move Petabytes of Data - Dinesh A. Joshi, Apple, Inc.

RayDP: Build Large-scale End-to-end Data Analytics and AI Pipelines Using Spark and Ray

GPU Reference Architecture Webinar — VAST Data

Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

Massive Data Processing in Adobe Using Delta Lake

Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks