Fast Inference of Mixture-of-Experts Language Models with Offloading

Показать описание

In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading".
Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays.
However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to run MoE models on low tier GPUs.
This paper introduces a method to efficiently run transformer based MoE LLMs on a limited memory environment using offloading techniques. Specifically, the researchers are able to run Mixtral-8x7B on the free-tier version of Google Colab.
In the video, we provide a reminder for how mixture of experts works, and then dive into the offloading method presented in this paper.

-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

👍 Please like & subscribe if you enjoy this content

-----------------------------------------------------------------------------------------------
Chapters:
0:00 Paper Introduction
1:34 Mixture of Experts
3:44 MoE Offloading
10:29 Mixed MoE Quantization
11:13 Inference Speed

Рекомендации по теме

Комментарии

Very exciting work! The resulting speed the paper proposes won't break any land speed records (2-3 tokens per second), but in my experience one of the most productive and practical applications of LLMs is prompting it with multiple choice questions, which only require a single token.

This paper (and provided code!) for GPT3.5 levels of inference running local on consumer hardware is a huge breakthrough, and I'm excited to give it a try!

winterclimber

I have been looking for channel like this for ages as I hate reading

jacksonmatysik

I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.

fernandos-bs

I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.

ameynaik

Fast Inference of Mixture-of-Experts Language Models with Offloading

Fast Inference of Mixture-of-Experts Language Models with Offloading

Fast Inference of Mixture-of-Experts Language Models with Offloading

Speeding Up Language Models: Fast Inference with Mixture of Experts

Understanding Mixture of Experts

[short] Fast Inference of Mixture-of-Experts Language Models with Offloading

What is Mixture of Experts?

Mistral 8x7B Part 1- So What is a Mixture of Experts Model?

Mixtral of Experts (Paper Explained)

Mixture of Experts MoE with Mergekit (for merging Large Language Models)

Unraveling LLM Mixture of Experts (MoE)

Stanford CS25: V4 I Demystifying Mixtral of Experts

Unlocking Mixture of Experts : From 1 Know-it-all to group of Jedi Masters — Pranjal Biyani

What are Mixture of Experts (GPT4, Mixtral…)?

Stanford CS25: V1 I Mixture of Experts (MoE) paradigm and the Switch Transformer

Mixtral 8X7B Crazy Fast Inference Speed

Mamba Might Just Make LLMs 1000x Cheaper...

Mixture of Experts LLM - MoE explained in simple terms

Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for LLMs Explained

LLama 2: Andrej Karpathy, GPT-4 Mixture of Experts - AI Paper Explained

Mixtral On Your Computer | Mixture-of-Experts LLM | Free GPT-4 Alternative | Tutorial

LLMs | Mixture of Experts(MoE) - I | Lec 10.1

Soft Mixture of Experts - An Efficient Sparse Transformer

Run Mixtral 8x7B MoE in Google Colab

Mixtral 8x7B DESTROYS Other Models (MoE = AGI?)