filmov
tv
Fast Inference of Mixture-of-Experts Language Models with Offloading

Показать описание
In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading".
Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays.
However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to run MoE models on low tier GPUs.
This paper introduces a method to efficiently run transformer based MoE LLMs on a limited memory environment using offloading techniques. Specifically, the researchers are able to run Mixtral-8x7B on the free-tier version of Google Colab.
In the video, we provide a reminder for how mixture of experts works, and then dive into the offloading method presented in this paper.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
👍 Please like & subscribe if you enjoy this content
-----------------------------------------------------------------------------------------------
Chapters:
0:00 Paper Introduction
1:34 Mixture of Experts
3:44 MoE Offloading
10:29 Mixed MoE Quantization
11:13 Inference Speed
Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays.
However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to run MoE models on low tier GPUs.
This paper introduces a method to efficiently run transformer based MoE LLMs on a limited memory environment using offloading techniques. Specifically, the researchers are able to run Mixtral-8x7B on the free-tier version of Google Colab.
In the video, we provide a reminder for how mixture of experts works, and then dive into the offloading method presented in this paper.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
👍 Please like & subscribe if you enjoy this content
-----------------------------------------------------------------------------------------------
Chapters:
0:00 Paper Introduction
1:34 Mixture of Experts
3:44 MoE Offloading
10:29 Mixed MoE Quantization
11:13 Inference Speed
Комментарии