vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

Показать описание

During our special topic deep dives, we were joined by Mistral AI’s research engineer, Patrick von Platen, who shared insights into Mistral’s architecture choices and how to efficiently deploy Mistral's models on vLLM.

During the Q&A, we tackled audience questions on topics such as architecture redesign strategies, rotary position embeddings, vLLM support for ARM architecture, OpenAI Whisper, Seq2Seq support in v0.6.3, and more.

Neural Magic
vllm
mistral
deploy mistral's models on vllm

Рекомендации по теме

Комментарии

Hi, great video, I would like to know if it is possible to use e5-mistral-7b-instruct in VLLM for embedding and completions with only one instance of VLLM?

micuentadecasa

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

Accelerating LLM Inference with vLLM

Fast LLM Serving with vLLM and PagedAttention

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

vLLM Office Hours - June 20, 2024

Boost Your AI Predictions: Maximize Speed with vLLM Library for Large Language Model Inference

vLLM and PagedAttention is the best for fast Large Language Models (LLMs) inferencey | Lets see WHY

CUDA Mode Keynote | Lily Liu | vLLM

E07 | Fast LLM Serving with vLLM and PagedAttention

Llama 3.2 Deep Dive - Tiny LM & NEW VLM Unleashed By Meta

But what is DeepSpeed ? DeepSpeed vs VLLM

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mist...

Deploy LLMs More Efficiently with vLLM and Neural Magic

Video Comprehension using GenAI | Qwen 2 VL 2B #llm #imageprocessing #imagerecognition #vlm #qwen

Unlocking LLM Efficiency: PagedAttention & vLLM Revolutionize Memory Management

vLLM on Kubernetes in Production

All You Need To Know About Running LLMs Locally

What is Retrieval-Augmented Generation (RAG)?

Accelerate Big Model Inference: How Does it Work?