Herbie Bradley – EleutherAI – Speeding up inference of LLMs with Triton and FasterTransformer

Показать описание

Triton Inference Server and FasterTransformer are solutions from Nvidia for deploying Transformer language models for fast inference at scale. I will talk about my experience successfully deploying these libraries to speed up inference of our code generation models in research by up to 10x.

AI Infrastructure Alliance

Рекомендации по теме

Комментарии

great video, would be great if you could also share the code of porting/deploying FasterTransformer on triton inference server. as publishing models there is slightly cumbersome and the tutorials don't do great justice.

stephennfernandes

Thank you very much! Do you know of any work porting LLaMa models to FasterTransformer?

SinanAkkoyun

Herbie Bradley – EleutherAI – Speeding up inference of LLMs with Triton and FasterTransformer

Herbie Bradley – EleutherAI – Speeding up inference of LLMs with Triton and FasterTransformer

Faster LLM Inference: Speeding up Falcon 7b For CODE: FalCODER 🦅👩‍💻

Knife Detection: An Object Detection Model Deployed on Triton Inference Sever reComputer for Jetson

Uncovering the Mindblowing Collaboration Between Google and NVIDIA for AI Cloud

Triton conf Qualcomm Meeting Recording

The AI Show: Ep 47 | High-performance serving with Triton Inference Server in AzureML

OpenAI Triton Backend for Needle

AWS On Air ft. FSI & Triton Tensor RT

Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time

Lightning talks: Training and inference efficiency

Getting Started with NVIDIA Triton Inference Server

Enfabrica: Scaling CXL Memory Using High Speed Networking

Large language models (LLMs) inference challenges | Michael Behar

Preview: ML Serving

Lightning Talk: Adding Backends for TorchInductor: Case Study with Intel GPU - Eikan Wang, Intel

How to Deploy HuggingFace’s Stable Diffusion Pipeline with Triton Inference Server

Bot Pao - Thai ChatBot using Nvidia Triton Inference Server (16K TTS)

Better Transformer: Accelerating Transformer Inference in PyTorch at PyTorch Conference 2022

Marine Palyan - Moving Inference to Triton Servers | PyData Yerevan 2022

AI Multi-turn conversation ( GitHub.com/microsoft/DialoGPT)

OSDI '22 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

NVIDIA's TensorRT-LLM: Supercharge LLM Inference on H100/A100 GPUs!

DeciLM 15x faster than Llama2 LLM Variable Grouped Query Attention Discussion and Demo

NVIDIA Triton Inference Server: Generative Chemical Structures