Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Показать описание

vLLM is a fast and easy-to-use library for LLM inference Engine and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels
vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
vLLM seamlessly supports many Huggingface models

❤️ If you want to support the channel ❤️
Support here:

Рекомендации по теме

Комментарии

my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!

seanmurphy

I appeared here after not being able to make OpenLLM working due to various issues on my PC locally. This is awesome. I got it working in a few minutes with different models. Thank you! Subscribing!

khorLDW

great work man...
can't thank enough. thanks again. great to see more indian AI tech talent in the out :D

sujantkumarkv

Thank you, your videos are becoming a daily thing for me

mohegyux

Thanks you are the vLLM of this space
love the speed of your videos. Colab let's more of us learn with less $$

deabyam

inference was the main bottel neck of LLMs, this is amazing thank you so much 🤩🤩🤩. Please make video on this page attention algorithm 🤩🤩

shamaldesilva

Fantastic video! Just what I wanted to see.

gpsb

Amazing tutorial my friend! Was looking for this. It would have been more helpful if you could explain how to deploy LLMs created using vLLM inference engine.

maazshaikh

Wow. Thank you for always bringing news to us ;)

MarceloLimaXP

Great information, really appreciate 🎉🎉🎉.
If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.

arjunm

mad props to the creator for breaking down vllm and its advantages over traditional llms

that page attention tech sounds lit, giving it those crazy throughput numbers

but, not gonna lie, using google collab as a production environment? kinda sus

still, respect the hustle for making it accessible to peeps without fancy GPUs

mad respect for that grind

moondevonyt

amazing 🔥🔥, btw any idea how we can patch this into Gradio ?? that way sharing and access will be much easier

anki

Thanks for this wonderful video. I want to know can we have RAG over the model with vllm. Also can we run vllm in a kubernetes cluster?

santoshshetty

Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?

cmeneseslob

Hello, in your opinion, what's better to use in production, TGI or vLLM ?

aliissa

Awesomeness man does it work with langchain?

pavanpraneeth

Does it support quantised models? Is it supposed in Oobabooga already? Quite an interesting topic - I hear ppl are using it in production often

alx

Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions

sakshatkatyarmal

So if I had to choose, one of the best LLMs from that selection would be Falcon?

ilianos

How does it compare to 4bit quantization?

aozynoob

Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

15 futuristic databases you’ve never heard of

Does a fast gearbox generate more electricity?

Ex Tesla Production Supervisor Reviews A Tesla

Vite in 100 Seconds

Free Zone | Hot Wheels World's Best Driver | Episode 1 | @HotWheels

PAW Patrol Fluffy Slime Time Game 🐶 Guess the Character! | Stay Home #WithMe | Nick Jr.

System Design: Why is single-threaded Redis so fast?

2025 Corvette C8 ZR1 vs The Worlds Fastest Cars 60-130

How To Edit YouTube Videos 10x Faster! - Productivity Hacks

BUGATTI Chiron 0-400-0 km/h in 42 seconds – A WORLD RECORD #IAA2017

What can a Black Hornet drone do?

How A Professional Chef Cuts An Onion

How To Draw A Bugatti Chiron (Front View)

PAW Patrol | Ready Race Rescue: Marshall vs. Cheetah | Nick Jr. UK

Racing A $250,000 Underwater Shark Submarine!

15 FASTEST Boats Ever Made

Xtreme Dance - When I Grow Up

Racing PIXAR CARS CHARACTERS on a REAL RACE TRACK! | Pixar Cars

10 Foods to Boost Nitric Oxide Production Fast.

TITIPO S1 EP5 l Show me how fast you can go! l Trains for kids l TITIPO TITIPO

10 MOST EXTREME VEHICLES EVER MADE

Flying a 120FPS Cinema Camera at 120km/h! Sony FX6 + Lumenier QAV-Pro Cinelifter

Biggest Mistake when growing Butternut squash ! Get Maximum Production Faster!#shorts #garden