vLLM - Turbo Charge your LLM Inference

Показать описание

vLLM - Turbo Charge your LLM Inference

For more tutorials on using LLMs and building Agents, check out my Patreon:

My Links:

Github:

Timestamps:
00:00 Intro
01:17 vLLM Blog
04:27 vLLM Github
05:40 Code Time

Рекомендации по теме

Комментарии

As always, you are one of the few people who hit this topic on YouTube.

rajivmehtapy

Thanks Sam for this video. It would be interesting to dedicate a video comparing OpenAI emulation such as LocalAI, Oobabooga, and vLLM

spyke

Talked to its core developer, they don't have plan to support quantized model yet, hence you really need powerfull GPU(s) to run it.

MultiSunix

A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.

mayorc

Finally AI models that don't take a year to give a response.
Cheers for sharing this Sam.

g-program-it

Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍

henkhbit

Sam I love you videos but this one takes the cake. Thank you!!!

TailorJohnson-ly

Thanks mate, I am going to try to add that to langchain so it can integrate seamlessly to my product

Rems

This is very interesting. Thanks for sharing this. It would be nicer, I guess, if langchain can do the same.

guanjwcn

Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit?

Edit: Noticed no falcon support..

MeanGeneHacks

Finally we can achieve fast responses.

wilfredomartel

My question is it does increase throughput by freeing up the memory to hold in more batches? But how does it achieve the speed up in latenc?

rakeshramesh

vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use which can load falcon 40b in 8-bit flawlessly!

MariuszWoloszyn

It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.

NickAubert

can you run this on aws sagemaker too? does it also work with llama 2 model with 7 and 13 billion parameters?

Gerald-xgrq

not sure since they compared with hf transformer and hf doesn't use flash attention to my knowledge so they are quite slow by default

frazuppi

Where is the model comparison made in terms of execution time wrt HuggingFace?

shishirsinha

I'm wondering how vLLM compares against conversion to onnx (e.g. with optimum) in terms of speed and ease of use. I'm struggling a bit with onnx 😅

io

It should be noted that for whatever reason it does not work with CUDA 12.x (yet).

clray

Can this package be used with quantized 4-bit models? I don't see any support for them in the docs..

asmacnolastname

vLLM - Turbo Charge your LLM Inference

vLLM - Turbo Charge your LLM Inference

Handshake's Approach to Content Tagging with vLLM and Anyscale | Ray Summit 2024

E07 | Fast LLM Serving with vLLM and PagedAttention

vLLM: Fast & Affordable LLM Serving with PagedAttention | UC Berkeley's Open-Source Library

VLLM: Rocket Enginer Of LLM Inference Speeding Up Inference By 24X

Run your LLM on Text Generation Inference without the Internet and make your Security team happy!

Deploying to Azure Container Apps to power your LLMs

Double Inference Speed with AWQ Quantization

WizardLM-33b-v1-uncensored ExLlama 4bit test

Petals: Fine-Tune and Inference 10x FASTER With a BitTorrent Architecture ON CLOUD!

BMW M850i (ENG) - Test Drive and Review

STOP Building AI SaaS ONLY on OpenAI APIs!!!

How to blend pricing for encapsulation and hot water extraction

Grok 2 Beats GPT4 Turbo. Did it Pass the Tests?

How to improve working on your legacy application

Rally 1200 grout cleaning

FrugalGPT: Better Quality and Lower Cost for LLM Applications // Lingjiao Chen // MLOps Podcast #172

🔥 How to Prune Large Language Models with Wanda 🔥

5 Proven Methods to Prevent AI Hallucinations

Fast Distributed Inference Serving for LLMs

LLaMA2 Tokenizer and Prompt Tricks

Training and deploying open-source large language models

Harmony and Dizchord - Megazord Fight | Episode 6 | Megaforce | Power Rangers Official

Fully Built Audi RS3 | Leaking Valve Stem Seals?