Efficient Inference of Extremely Large Transformer Models

preview_player
Показать описание
The rise of transformer-based language models has seen a boom in model sizes, since these models’ performance scales extremely well with size. With this comes the challenge to develop solutions to make inference on these models more efficient. We'll show how these behemoth multi-billion-parameter models are optimized for production and how the inference tech stack is established. We'll cover the key ingredients in making these models faster, smaller, and more cost-effective, including model compression, efficient attention, and optimal model parallelism.

Bharat Venkitesh, Senior Machine Learning Engineer, Cohere
Рекомендации по теме