Parallel inferencing with KServe Ray integration

Показать описание

KServe is a Opensource production-ready model inference framework on Kubernetes utilizing many knative's features such as routing for canary traffic and payload logging. However, the one model per container paradigm limits the concurrency and throughput when sending multiple inference requests. With RayServe integration, a model can be deployed as individual Python workers allowing for parallel inference. This enables concurrent inference requests to be processed simultaneously, improving overall efficiency. In this talk, we will share how you can configure, run, and scale machine learning models in Kubernetes using KServe and Ray.

About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

If you're interested in a managed Ray service, check out:

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.

#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

Anyscale

Рекомендации по теме

Комментарии

I thought kserve only uses ray serve as a package which would behave more like multiprocessing package? I didn't notice installing Ray Cluster when installing the kserve manifest

mmoya

Parallel inferencing with KServe Ray integration

Parallel inferencing with KServe Ray integration

Enabling Cost-Efficient LLM Serving with Ray Serve

Serverless Machine Learning Model Inference on Kubernetes with KServe by Stavros Kontopoulos

Exploring ML Model Serving with KServe (with fun drawings) - Alexa Nicole Griffith, Bloomberg

Accelerate Federated Learning Model Deployment with KServe (KFServing) - Fangchi Wang & Jiahao C...

How We Built an ML inference Platform with Knative - Dan Sun, Bloomberg LP & Animesh Singh, IBM

Ray Serve: Tutorial for Building Real Time Inference Pipelines

Open-source Chassis.ml - Deploy Model to KServe

Fast LLM Serving with vLLM and PagedAttention

Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Deploying Many Models Efficiently with Ray Serve

Faster and Cheaper Offline Batch Inference with Ray

Kubeflow Essentials 7-2. Kserve (Architecture Concepts)

How to Create a Custom Serving Runtime in KServe ModelMesh to S... Rafael Vasquez & Christian Ka...

Serving Machine Learning Models at Scale Using KServe - Animesh Singh, IBM - KubeCon North America

modelcar-demo

Introducing Ray Serve: Scalable and Programmable ML Serving Framework - Simon Mo, Anyscale

What's New, ModelMesh? Model Serving at Scale - Rafael Vasquez, IBM

Serving Machine Learning Models at Scale Using KServe - Yuzhui Liu, Bloomberg

Accelerating LLM Inference with vLLM

Ray Serve: Patterns of ML Models in Production

Inference Graphs at LinkedIn Using Ray-Serve

KServe: The State and Future of Cloud Native Model Serving (Kubeflow Summit 2022)

Custom Code Deployment with KServe and Seldon Core