How to serve 10,000 fine-tuned LLMs from a single GPU

preview_player

Показать описание

You can now swap LoRAs during LLM inference with TensorRT-LLM on Baseten. This means you can serve thousands of fine-tuned variants of an LLM from a single GPU while maintaining a low time to first token (TTFT) and a high tokens per second (TPS).

Inference-time LoRA swapping with TensorRT-LLM supports in-flight batching and loads LoRA weights in 1-2 milliseconds, enabling each request to hit a different fine-tuned model in real time.

Рекомендации по теме