Heterogeneous Training Cluster with Ray at Netflix

preview_player
Показать описание
At Netflix, Machine Learning algorithms are at the heart of various use cases such as recommendations, content understanding, content demand modeling, trailer and artwork generation and various other content creation use cases. Scaling these use cases to entertain our members can significantly leverage deep learning techniques. The Machine Learning Platform team at Netflix is tasked with constructing the necessary infrastructure and tools to optimize the effectiveness of all machine learning practitioners across the company. We are constantly striving to ensure that our machine learning models are trained and deployed in a reliable, scalable and robust way.

Deep learning models have grown in complexity, requiring significantly more computational resources to train. In this Talk, we explore the benefits of using Ray for building a heterogeneous training cluster, and discuss the steps involved in setting up such a cluster. We demonstrate how to run distributed training jobs on the cluster with a mix of CPU instances and GPU instances, and show how Ray's automatic resource allocation and management can facilitate the scheduling of different types of workers .Additionally, we discuss the challenges and considerations that come with building and managing persistent clusters using Ray, and provide best practices for effective cluster configuration and management.

About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

If you're interested in a managed Ray service, check out:

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.

#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai
Рекомендации по теме