filmov
tv
Large Scale Data Loading and Data Preprocessing with Ray

Показать описание
(Wei Chen, NVIDIA)
Data loading is one of the most crucial steps in the DL pipeline. It needs to be designed and implemented in both a flexible and performant manner so that (1) it can be resued to support different DNN models, (2) it can match the speed of GPU compute, and (3) it can scale to multi-cores and even multi-nodes. However, achieving these design goals is not trivial, especially given that the most commonly used language in DL is python in which there is no good support for parallel programming.
In this talk, we will show that how we can use Ray to implement our data loading pipeline. Powered by the Ray actor, we are able to reuse most of our python modules and run our data loading pipeline in parallel without worrying about the overhead of managing it at scale. We will also talk about the experience and lessons we learned during our implementation and production depoyment.
Data loading is one of the most crucial steps in the DL pipeline. It needs to be designed and implemented in both a flexible and performant manner so that (1) it can be resued to support different DNN models, (2) it can match the speed of GPU compute, and (3) it can scale to multi-cores and even multi-nodes. However, achieving these design goals is not trivial, especially given that the most commonly used language in DL is python in which there is no good support for parallel programming.
In this talk, we will show that how we can use Ray to implement our data loading pipeline. Powered by the Ray actor, we are able to reuse most of our python modules and run our data loading pipeline in parallel without worrying about the overhead of managing it at scale. We will also talk about the experience and lessons we learned during our implementation and production depoyment.
Large Scale Data Loading and Data Preprocessing with Ray
Dr. Thomas Wollmann: Squirrel - Efficient Data Loading for Large-Scale Deep Learning
FAST '19 - DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed....
NSDI '21 - MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds
Building Large-scale Production Systems for Latency-sensitive Applications
What is Data Pipeline? | Why Is It So Popular?
7 Must-know Strategies to Scale Your Database
Data Movement at Very Large Scale | QCon SF 2013
Unleashing the Power of Big Data | A New Era of Analytics 🚀
Explore Fundamentals Of Data Analytics In Azure For Large Scale Data| K21Academy
Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con ...
How-to | Conduct Large-Scale Data Warehousing with MaxCompute
Elasticsearch: Distributed System for Large Scale Data needs | Aravind Putrevu | GeekNight 63
Data Analysis Project | Large Scale Data Analysis | Switching from Pandas to FireDucks
Large-scale Data Process and ML Pipelines
Database vs Data Warehouse vs Data Lake | What is the Difference?
Resource-Efficient Redundancy for Large-Scale Data Processing and Storage Systems
Process HUGE Data Sets in Pandas
Data Pipeline Overview
How to scale a web application to a million users in 10 steps
| Microsoft Azure Data Fundamentals | Fundamentals of Large-Scale Data Warehousing |
System Design: Scale System From Zero To Million Users | #systemdesign
Large-scale data ingest on GCP (Google Cloud Next '17)
Unifying Large Scale Data Preprocessing and ML Pipelines with Ray Datasets | PyData Global 2021
Комментарии