Herbie Bradley – EleutherAI – Speeding up inference of LLMs with Triton and FasterTransformer

preview_player
Показать описание
Triton Inference Server and FasterTransformer are solutions from Nvidia for deploying Transformer language models for fast inference at scale. I will talk about my experience successfully deploying these libraries to speed up inference of our code generation models in research by up to 10x.
Рекомендации по теме
Комментарии
Автор

great video, would be great if you could also share the code of porting/deploying FasterTransformer on triton inference server. as publishing models there is slightly cumbersome and the tutorials don't do great justice.

stephennfernandes
Автор

Thank you very much! Do you know of any work porting LLaMa models to FasterTransformer?

SinanAkkoyun