Mixtral of Experts (Paper Explained)

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed


You know what they say. We only use 10% of our MOE.


~1:30 ~29:30 Mistral was from the start secretive about training data even when asked directly on discord their answer was non-answer.
~4:00, yeah, mixtral have 46.7B params (according to HF parm counter which counts data in safetensors)
~19:30 yeah, G(X) is n-dimensional, so G(X)i is a scalar. It would be "fun" if each value inside the input token vector was passed through every possible pair of experts and output vector consisted of values where each value got routed through its own TOPK(2) experts (so o11 = e1(x1), o12 = e2(x1); y=[o11[0], o12[1]] where eN - expert N, x1 - input, y - output)
~31:30 if they look at validation of the pile, it's safe to assume they used training on it too.


I think the concept of "clown car of experts" that hobbyists came up with from this might have some potential. It's about merging different feed-forward networks from existing pre-trained models together as experts, and just training the routing network to adapt to the experts. I played around with some, and it seems to works pretty well, much better than old-school merges.


Thanks for your service Yannic! Any chance you'll do a video on DPO? Seems promising, would love to see your explanation/take.


Yannic paper review = automatic like. Keep 'em coming !


I recently subscribed your channel and got amazed with your explanation. I had saw a video before yours and got 10% or 20% of understanding about what is MoE. After I saw your video, not only MoE, but I understand Sparse MoE as well. Thank you. Keep it going.


I love the reutilization of older modes of modeling in novel ways.


What do you think may the training of the router look like?


Thanks. I really like this format and the length was perfect too 😎


I was expecting to see some patterns arising where different "expert" would gain expertise in different tasks. What i instead of using the router the "expert" is randomly chosen? That'd clearly demonstrate if any expertise is truly emerging.


How much farther down the road can we go with splitting up the processing of a model. MOE allows me to run in system RAM on CPU, 4x faster than if the processing were monolithic. I'm wondering if one should run a gargantuan model in 1tb.of RAM on an ancient surplus server, but have the model split into a couple hundred posts, only a couple of which run at once


The video is really great. I learned a lot. You mensioned that "I've made videos in the past on mixture of experts, expert routing, etc." Could you please paste the link to that video so that I can learn more. Thanks sooo much.



Really said that data not only is getting gated but also now being omitted. This will not lead to much process outside of closed industry shops.


Thanks for the overview of this paper! I read it at one point because I was impressed with the results of the model itself, but it's always good to get a second look at it. Would that second look be a "Mixture of Experts"...?

Regardless, for anyone looking to do some homework and followup reading:
"Approximating Two-Layer Feedforward Networks for Efficient Transformers" has some pretty interesting findings about routing and scaling of MoE overall (apparently softmax is basically evil)
"Efficient shallow learning mechanism as an alternative to deep learning" argues that the brain isn't really comparable to a "deep" neural network and that a "wider" network may be more ideal for complex ideas. Depending on how far one was willing to stretch it, one could argue that MoE is an extension of or step in that direction.

I'd be really curious to see an overview of "Inferring neural activity before plasticity as a foundation for learning beyond backpropagation", though (unrelated to MoE). It's apparently an alternative to backpropogation that is more tolerant of higher learning rates, deeper networks while giving less overshoot and catastrophic forgetting. I'm having some difficulty getting through the paper because it's quite dense, but I think it's a really different look at training.


Can you do the MoE-Mamba paper?

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts


29:00, I think the input dataset is as important as the model design. Without good input, even the best models would fail to be interesting. With good data, even a poor model can perform in interesting ways. I totally agree that these researchers need to release their dataset since they're not actually providing their methods, needed for reproducibility.


Having played with sparse-transformers using my own chess bot, a large model only partially perform like a smaller model, you still have to keep all the weights in your (GPU) memory, and it does limit your batch sizes for training and context lengths for inference.


I find the "8 experts" terminology and "8x7B" notation quite confusing and misleading. I cannot say how many professional practioners think it's "8 expert models" collaborating like in an ensemble way. It's actually 8 expert modules *per layer*, and there are 32 layers, so a total of 32x8 = 256 independent "experts" and not 8 experts. Plus if you really think in terms of end to end processing/activation path as one "expert", each token can have (8 choose 2)=28 possible expert paths per layer, and there are 32 layers, so the total of expert paths each single token can take is 28^32. So in reality, there are 28^32 end-to-end "expert" alternative activation paths for each token. So all in all, its either 8×32=256 experts or 28^32 experts, but defintely not 8 experts in this model
