Efficient Inference of Extremely Large Transformer Models

Показать описание

The rise of transformer-based language models has seen a boom in model sizes, since these models’ performance scales extremely well with size. With this comes the challenge to develop solutions to make inference on these models more efficient. We'll show how these behemoth multi-billion-parameter models are optimized for production and how the inference tech stack is established. We'll cover the key ingredients in making these models faster, smaller, and more cost-effective, including model compression, efficient attention, and optimal model parallelism.

Bharat Venkitesh, Senior Machine Learning Engineer, Cohere

Toronto Machine Learning Series (TMLS)

Рекомендации по теме

Efficient Inference of Extremely Large Transformer Models

Efficient Inference of Extremely Large Transformer Models

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Taming the Large language models – Efficient inference of Multi-billion parameter models

Boost LLM Efficiency on CPUs: Simplified Inference Techniques for Optimal Performance

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Efficient Large-Scale AI Workshop | Session 2: Training and inference efficiency

Accelerate Big Model Inference: How Does it Work?

Yuandong Tian | Efficient Inference of LLMs with Long Context Support

Hypothesis testing in statistics session 54

Eytan Bakshy: Efficient Experimentation and Inference for Large Decision Spaces

[ICML 2024] InferCept: Efficient Intercept Support for Augmented Large Language Model Inference

#6 The Groundbreaking AI Fleet Inference of Tesla's Future

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

8. The Path Toward Energy-Efficient Inference Engine Architectures on Scaled and Beyond-CMOS Fabrics

Revolutionizing LLMs: Efficient Inference with Flash Memory

Lightning Talk: Efficient Inference at the Edge: Performance You Need at the Lowest... - Felix Baum

Speculative Execution for Efficient Inference in Large Language Models on Consumer Devices

🤖🧑‍🏫 Diving into AI Training vs Inference #ai #aitraining #inference #datacenter #datacloud #tech...

[AUTOML23] Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference

2024 606 BOLT Privacy Preserving, Accurate and Efficient Inference for Transformers Qi Pang

Doug Downey - Large Topic Models: Efficient Inference and Applications

[REFAI Seminar 03/30/23] Efficient Trillion Parameter Scale Training and Inference with DeepSpeed

Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference

High-Performance Training and Inference on GPUs for NLP Models