Boost Your AI Predictions: Maximize Speed with vLLM Library for Large Language Model Inference

preview_player
Показать описание
Discover vLLM, UC Berkeley's open-source library for fast LLM inference, featuring a PagedAttention algorithm for up to 24x higher throughput than HuggingFace Transformers. We'll compare vLLM and HuggingFace using the LLama 2 7b model, and learn how to easily integrate vLLM into your projects.

Join this channel to get access to the perks and support my work:

00:00 - What is vLLM?
03:27 - vLLM Quickstart
04:58 - Google Colab Setup (with Llama 2)
07:19 - Single Example Inference Comparison
08:57 - Batch Inference Comparison
10:29 - Conclusion

#artificialintelligence #llm #mlops #llama2 #chatbot #promptengineering #python
Рекомендации по теме
Комментарии
Автор

Is there a way to load quantized models using vLLM?

thevadimb
Автор

Awesome ❤,
How to run llms pre downloaded in my disk?

AliAlias
Автор

Did you try tensorRT-llm with triton backend llm?

Gerald-izmv