Understanding Mixture of Experts

Показать описание

Chapters
0:00 GPT-3, GPT-4 and Mixture of Experts
0:55 Why Mixture of Experts?
2:35 The idea behind Mixture of Experts
3:59 How to train MoE
5:41 Problems training MoE
7:54 Adding noise during training
9:06 Adjusting the loss function for router evenness
10:56 Is MoE useful for LLMs on laptops?
12:37 How might MoE help big companies like OpenAI?
14:22 Disadvantages of MoE
15:42 Binary tree MoE (fast feed forward)
18:15 Data on GPT vs MoE vs FFF
21:55 Inference speed up with binary tree MoE
23:48 Recap - Does MoE make sense?
25:05 Why might big companies use MoE?

Рекомендации по теме

Комментарии

This is one of the best explanations for MoE. Going into enough depth to give good idea about internal workings, problems, evaluation results. Great work!

TripUnico-personalizedTrip

You made a complex topic appear simple by giving just the right insight at the right time, thereby hitting the sweet spot between making it indigestible and way too simplified. I was really wondering about the training process and you gave invaluable insight about that. It is not made clear in the paper and the code was also somewhat confusing. So, thanks for that buddy.

HuxleyCrimson

One of the more approachable videos on the concept in YouTube.

maybmb_

thank you for this accessible explanation of a somewhat complex subject

pilotgfx

12:20 I heard that there is a minimum size for an expert to become reasonably functional.

It worked for GPT4 because it had 1, 800b parameters, which was more than it needed considering the size of the data set used.

However, splitting a 7b parameter LLM like Mistral into 8 would make each expert less than 1b parameters. As a result it may have ~8x faster inference but the performance of even the best expert chosen by the router would be much worse than the original 7b parameter Mistral, or even a half sized 3.5b Mistral. Even at 70b parameters (Llama 2) a mixture of elements would perform significantly worse in response to every prompt than the original 70b LLM, or even a half sized 35b Llama 2.

It's not until the parameter count starts to exceed what is ideally required considering the size of the input corpus that a MOE becomes reasonable. And even then a 1, 800b parameter non-MOE GPT4 would perform ~10% better than a MOE, but such a small bump in performance isn't worth the ~8x inference cost. And using a 225b non-MOE GPT4 would perform much worse than the ideally chosen 225b expert. So in the end you get a notable bump in performance with the same inference cost.

Yet at 180b or less a corpus capable of capturing a web dump, 1000s of books... is too big to be reasonably split into a MOE. Each expert needs to be larger than a minimum size (~100b or more) to capture the nuances of language and knowledge every expert requires as a base in order to respond as reasonably and articulately as GPT4 does.

brandon

Great video and a really clear description. Thanks a lot!

troy_neilson

Incredibly well made video. Thank you.

Shaunmcdonogh-shaunsurfing

I like how you think, you found a new sub

keeganpenney

Matrices represent weights. Not neurons. The biases in the neurons are represented using vectors that are added after multiplying by a matrix.

keypey

Loved your presentation.... Mixtral mentions using a TopK() for routing... how can such a method work if they use Fast Feed Forward (All are binary decisions)

AkshayKumar-sdmx

Isn't MOE good at Multi-task learning and Multi-objective scenarios? Isn't that one of the main reasons to employ MOE - that was my understanding, will be great to get your thoughts

sampathkj

Very interesting! Would it not be worth to test with one introductory sentence with a dedicated sentence pointing to the subject of the chat Vs no such leading sentence

konstantinlozev

Where does the router sit? Is it with every expert in a GPU or it sits on the CPU.

nishkarve

Gpt-3 came out in the summer of 2020. Maybe you meant chatgpt came out in November of 22?

ResIpsa-pkih

last time I had to deal with tokens, I was putting them in the skeeball at Chuck e Cheese, lol. That was the last time. oh, no, there's macros. nm.

I came to learn about MoE, but got some interesting training on Fast feed forward networks. Pretty cool. Might have to watch this again.

From what I'm learning, this can't use like ControlNet or LoRA adapters, right?

Seems like MoE is only for the big boys - only someone able to afford a blackwell, or another recent big dog gpu.

jeffg

Isn’t a mixture of expert is similar to a GAN by having two networks that use each other to improve.

franktfrisby

Why 8 experts? Is there any structural consideration behind the choice?

JunjieCao-qu

Why not intentionally train each expert in a topic? To make it an expert in something?

ernststravoblofeld

Understanding Mixture of Experts

What is Mixture of Experts?

Understanding Mixture of Experts

Mistral 8x7B Part 1- So What is a Mixture of Experts Model?

Introduction to Mixture-of-Experts (MoE)

What are Mixture of Experts (GPT4, Mixtral…)?

Mixtral of Experts (Paper Explained)

A Visual Guide to Mixture of Experts (MoE) in LLMs

Mixture of Experts: The Secret Behind the Most Advanced AI

Amazon Kitchen Finds EXPERTS Swear By in 2024!

1 Million Tiny Experts in an AI? Fine-Grained MoE Explained

Stanford CS25: V1 I Mixture of Experts (MoE) paradigm and the Switch Transformer

How Did Open Source Catch Up To OpenAI? [Mixtral-8x7B]

Mixture of Experts Explained in 1 minute

Mixture of Experts LLM - MoE explained in simple terms

Soft Mixture of Experts - An Efficient Sparse Transformer

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Looking back at Mixture of Experts in Machine Learning (Paper Breakdown)

Mixture of Experts in GPT-4

Mixtral - Mixture of Experts (MoE) from Mistral

Mixture-of-Experts vs. Mixture-of-Agents

Why Mixture of Experts? Papers, diagrams, explanations.

Stanford CS25: V4 I Demystifying Mixtral of Experts

Multi-Head Mixture-of-Experts

Mixture of Experts in AI. #aimodel #deeplearning #ai