filmov
tv
How to serve 10,000 fine-tuned LLMs from a single GPU
Показать описание
You can now swap LoRAs during LLM inference with TensorRT-LLM on Baseten. This means you can serve thousands of fine-tuned variants of an LLM from a single GPU while maintaining a low time to first token (TTFT) and a high tokens per second (TPS).
Inference-time LoRA swapping with TensorRT-LLM supports in-flight batching and loads LoRA weights in 1-2 milliseconds, enabling each request to hit a different fine-tuned model in real time.
Inference-time LoRA swapping with TensorRT-LLM supports in-flight batching and loads LoRA weights in 1-2 milliseconds, enabling each request to hit a different fine-tuned model in real time.