Fast Inference of Mixture-of-Experts Language Models with Offloading

preview_player
Показать описание
In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading".
Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays.
However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to run MoE models on low tier GPUs.
This paper introduces a method to efficiently run transformer based MoE LLMs on a limited memory environment using offloading techniques. Specifically, the researchers are able to run Mixtral-8x7B on the free-tier version of Google Colab.
In the video, we provide a reminder for how mixture of experts works, and then dive into the offloading method presented in this paper.

-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

👍 Please like & subscribe if you enjoy this content

-----------------------------------------------------------------------------------------------
Chapters:
0:00 Paper Introduction
1:34 Mixture of Experts
3:44 MoE Offloading
10:29 Mixed MoE Quantization
11:13 Inference Speed
Рекомендации по теме
Комментарии
Автор

Very exciting work! The resulting speed the paper proposes won't break any land speed records (2-3 tokens per second), but in my experience one of the most productive and practical applications of LLMs is prompting it with multiple choice questions, which only require a single token.

This paper (and provided code!) for GPT3.5 levels of inference running local on consumer hardware is a huge breakthrough, and I'm excited to give it a try!

winterclimber
Автор

I have been looking for channel like this for ages as I hate reading

jacksonmatysik
Автор

I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.

fernandos-bs
Автор

I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.

ameynaik