vLLM - Turbo Charge your LLM Inference

preview_player
Показать описание
vLLM - Turbo Charge your LLM Inference

For more tutorials on using LLMs and building Agents, check out my Patreon:

My Links:

Github:

Timestamps:
00:00 Intro
01:17 vLLM Blog
04:27 vLLM Github
05:40 Code Time
Рекомендации по теме
Комментарии
Автор

As always, you are one of the few people who hit this topic on YouTube.

rajivmehtapy
Автор

Thanks Sam for this video. It would be interesting to dedicate a video comparing OpenAI emulation such as LocalAI, Oobabooga, and vLLM

spyke
Автор

Talked to its core developer, they don't have plan to support quantized model yet, hence you really need powerfull GPU(s) to run it.

MultiSunix
Автор

A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.

mayorc
Автор

Finally AI models that don't take a year to give a response.
Cheers for sharing this Sam.

g-program-it
Автор

Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍

henkhbit
Автор

Sam I love you videos but this one takes the cake. Thank you!!!

TailorJohnson-ly
Автор

Thanks mate, I am going to try to add that to langchain so it can integrate seamlessly to my product

Rems
Автор

This is very interesting. Thanks for sharing this. It would be nicer, I guess, if langchain can do the same.

guanjwcn
Автор

Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit?

Edit: Noticed no falcon support..

MeanGeneHacks
Автор

Finally we can achieve fast responses.

wilfredomartel
Автор

My question is it does increase throughput by freeing up the memory to hold in more batches? But how does it achieve the speed up in latenc?

rakeshramesh
Автор

vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use which can load falcon 40b in 8-bit flawlessly!

MariuszWoloszyn
Автор

It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.

NickAubert
Автор

can you run this on aws sagemaker too? does it also work with llama 2 model with 7 and 13 billion parameters?

Gerald-xgrq
Автор

not sure since they compared with hf transformer and hf doesn't use flash attention to my knowledge so they are quite slow by default

frazuppi
Автор

Where is the model comparison made in terms of execution time wrt HuggingFace?

shishirsinha
Автор

I'm wondering how vLLM compares against conversion to onnx (e.g. with optimum) in terms of speed and ease of use. I'm struggling a bit with onnx 😅

io
Автор

It should be noted that for whatever reason it does not work with CUDA 12.x (yet).

clray
Автор

Can this package be used with quantized 4-bit models? I don't see any support for them in the docs..

asmacnolastname