Large Scale Data Loading and Data Preprocessing with Ray

Показать описание

(Wei Chen, NVIDIA)

Data loading is one of the most crucial steps in the DL pipeline. It needs to be designed and implemented in both a flexible and performant manner so that (1) it can be resued to support different DNN models, (2) it can match the speed of GPU compute, and (3) it can scale to multi-cores and even multi-nodes. However, achieving these design goals is not trivial, especially given that the most commonly used language in DL is python in which there is no good support for parallel programming.

In this talk, we will show that how we can use Ray to implement our data loading pipeline. Powered by the Ray actor, we are able to reuse most of our python modules and run our data loading pipeline in parallel without worrying about the overhead of managing it at scale. We will also talk about the experience and lessons we learned during our implementation and production depoyment.

Anyscale

Рекомендации по теме

Комментарии

My wish is that there would be tpcds or testing suite to do load performance testing.

AlbertWong-zo

Large Scale Data Loading and Data Preprocessing with Ray

Large Scale Data Loading and Data Preprocessing with Ray

Dr. Thomas Wollmann: Squirrel - Efficient Data Loading for Large-Scale Deep Learning

FAST '19 - DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed....

NSDI '21 - MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds

Building Large-scale Production Systems for Latency-sensitive Applications

What is Data Pipeline? | Why Is It So Popular?

7 Must-know Strategies to Scale Your Database

Data Movement at Very Large Scale | QCon SF 2013

Unleashing the Power of Big Data | A New Era of Analytics 🚀

Explore Fundamentals Of Data Analytics In Azure For Large Scale Data| K21Academy

Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con ...

How-to | Conduct Large-Scale Data Warehousing with MaxCompute

Elasticsearch: Distributed System for Large Scale Data needs | Aravind Putrevu | GeekNight 63

Data Analysis Project | Large Scale Data Analysis | Switching from Pandas to FireDucks

Large-scale Data Process and ML Pipelines

Database vs Data Warehouse vs Data Lake | What is the Difference?

Resource-Efficient Redundancy for Large-Scale Data Processing and Storage Systems

Process HUGE Data Sets in Pandas

Data Pipeline Overview

How to scale a web application to a million users in 10 steps

| Microsoft Azure Data Fundamentals | Fundamentals of Large-Scale Data Warehousing |

System Design: Scale System From Zero To Million Users | #systemdesign

Large-scale data ingest on GCP (Google Cloud Next '17)

Unifying Large Scale Data Preprocessing and ML Pipelines with Ray Datasets | PyData Global 2021