vLLM Faster LLM Inference || Gemma-2B and Camel-5B

preview_player
Показать описание
vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
- Optimized CUDA kernels

Reach out to me:
------------------------
------------------------

Timestamp
00:00 Introduction
01:23 Code Implementation
03:40 Gemma-2B Inference
10:00 Camel-5B Inference

#llm #largelanguagemodels #ai #generativeai #vllm
Рекомендации по теме
Комментарии
Автор

Hii Tarun, what is the limit for hugging face embeddings ?? like for voyage-ai is 50 million free tokens !!

Aditya_-nv
visit shbcf.ru