Mixtral of Experts (Paper Explained)

preview_player
Показать описание
#mixtral #mistral #chatgpt

OUTLINE:
0:00 - Introduction
3:00 - Mixture of Experts
6:00 - Classic Transformer Blocks
11:15 - Expert Routing
17:00 - Sparse Expert Routing
22:00 - Expert Parallelism
25:00 - Experimental Results
31:30 - Routing Analysis
33:20 - Conclusion

Abstract:
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

OUTLINE:
0:00 - Introduction
3:00 - Mixture of Experts
6:00 - Classic Transformer Blocks
11:15 - Expert Routing
17:00 - Sparse Expert Routing
22:00 - Expert Parallelism
25:00 - Experimental Results
31:30 - Routing Analysis
33:20 - Conclusion

YannicKilcher
Автор

You know what they say. We only use 10% of our MOE.

theosalmon
Автор

~1:30 ~29:30 Mistral was from the start secretive about training data even when asked directly on discord their answer was non-answer.
~4:00, yeah, mixtral have 46.7B params (according to HF parm counter which counts data in safetensors)
~19:30 yeah, G(X) is n-dimensional, so G(X)i is a scalar. It would be "fun" if each value inside the input token vector was passed through every possible pair of experts and output vector consisted of values where each value got routed through its own TOPK(2) experts (so o11 = e1(x1), o12 = e2(x1); y=[o11[0], o12[1]] where eN - expert N, x1 - input, y - output)
~31:30 if they look at validation of the pile, it's safe to assume they used training on it too.

AM-ykyd
Автор

I think the concept of "clown car of experts" that hobbyists came up with from this might have some potential. It's about merging different feed-forward networks from existing pre-trained models together as experts, and just training the routing network to adapt to the experts. I played around with some, and it seems to works pretty well, much better than old-school merges.

stephaneduhamel
Автор

Thanks for your service Yannic! Any chance you'll do a video on DPO? Seems promising, would love to see your explanation/take.

siddharth-gandhi
Автор

Yannic paper review = automatic like. Keep 'em coming !

MultiMojo
Автор

I recently subscribed your channel and got amazed with your explanation. I had saw a video before yours and got 10% or 20% of understanding about what is MoE. After I saw your video, not only MoE, but I understand Sparse MoE as well. Thank you. Keep it going.

bobaktadjalli
Автор

I love the reutilization of older modes of modeling in novel ways.

axe
Автор

What do you think may the training of the router look like?

tenaciousscaler
Автор

Thanks. I really like this format and the length was perfect too 😎

erikdahlen
Автор

I was expecting to see some patterns arising where different "expert" would gain expertise in different tasks. What i instead of using the router the "expert" is randomly chosen? That'd clearly demonstrate if any expertise is truly emerging.

snoosnoo
Автор

How much farther down the road can we go with splitting up the processing of a model. MOE allows me to run in system RAM on CPU, 4x faster than if the processing were monolithic. I'm wondering if one should run a gargantuan model in 1tb.of RAM on an ancient surplus server, but have the model split into a couple hundred posts, only a couple of which run at once

theosalmon
Автор

The video is really great. I learned a lot. You mensioned that "I've made videos in the past on mixture of experts, expert routing, etc." Could you please paste the link to that video so that I can learn more. Thanks sooo much.

YufeiWang-zt
Автор

*Summary*

*Introduction to Mixtral of Experts Model*
- 0:00 Discussion about the Mixtral of experts model, built on the Mistl 7B architecture.
- 0:30 The paper is nicknamed "Don't Say Data" due to its lack of information on training data sources.

*Analysis of Data Source Disclosure Trends*
- 0:49 Observation of trends in professional criticism regarding AI training data sources.
- 1:40 Introduction of Mist AI, a startup with an open-source approach and its comparison to other AI startups.

*Overview of Mixtral Model and Its Features*
- 2:42 Explanation that Mixtral 8x7B is a Transformer with a mixture of experts architecture.
- 3:04 Mixtral model's performance, parameter count, and comparison with other models like Llama 270B and GPT 3.5.
- 4:16 Description of expert routing in the model, allowing the use of a subset of parameters per token for optimization.
- 5:02 Details of the model's decoder-only architecture and feature of picking from distinct parameter groups.

*Training Data and Multilingual Pre-training*
- 5:25 Mention of multilingual data used in pre-training the Mixtral model, without specific details on the data sources.

*Understanding Mixture of Experts in Transformer Models*
- 5:58 Explanation of the core components of classic Transformer models, focusing on attention and feed-forward layers.
- 8:17 Insight into the feed-forward network's role and parameter distribution in Transformer models.
- 11:15 Introduction to the concept of mixture of experts and its transformative effect on feed-forward networks.
- 12:56 Explanation of sparse mixture of experts and the role of a routing neural network in the process.

*Routing Mechanism in Mixtral Model*
- 15:03 Explanation of the weighted sum process in routing tokens to experts.
- 15:41 The routing network uses the input signal to determine the computation path for each token.
- 16:03 Analogy of distributing people to jobs based on their attributes to explain the routing process.
- 16:32 The routing function (F) is a small neural network determining the routing of tokens.
- 17:00 Discussion of the sparse expert routing mechanism and its computational efficiency.

*Details of the Mixture of Experts Mechanism*
- 17:37 Absence of entropy regularization in routing, which is often found in initial mixture of experts papers.
- 18:02 EI denotes the output of each expert, with n representing the number of experts.
- 18:35 Clarification on a potential error in the paper regarding the output of the gating network.
- 19:43 The gating network involves a linear feed-forward layer.

*Model Parameterization and Efficiency*
- 20:11 Distinction between total and active parameter count in the model.
- 20:57 Explanation of processing each token individually through the feed-forward stage.
- 21:45 Discussion on the active parameter count and its dependence on the number of experts considered per token.
- 22:25 Description of expert parallelism for high throughput, involving different GPUs for each expert.

*Experimental Results and Performance Analysis*
- 25:03 Overview of experimental results comparing the Mixtral model with other models like Llama 2 and GPT 3.5.
- 26:08 Discussion on dynamic selection of active parameters for each token.
- 26:47 Results showing the model's capability in reasoning and retrieval tasks.
- 27:24 Analysis of perplexity decrease in relation to context length, emphasizing the importance of smart context selection.
- 28:46 Skepticism about the usefulness of bias benchmarks.
- 28:57 Mention of supervised fine-tuning on an instruction dataset and paired feedback dataset.
- 29:39 Commentary on the model's release under Apache License and its impact on the community.

*Reflections on the Release and Impact of the Model*
- 30:17 Discussion on the significance of releasing the model under a fully open license.
- 30:39 Appreciation for the model's release strategy, highlighting its impact on the community.

*Speculations on Business Strategy and Data Set Disclosure*
- 30:45 Speculation on the business value or risk related to the lack of disclosure about the training dataset.
- 31:05 Possibility that withholding dataset details might be a strategy to provoke critics or simply a choice to not disclose obvious sources.

*Routing Analysis in Mixtral Model*
- 31:31 Analysis of how tokens are routed to different experts in the Mixtral model.
- 31:45 Observation that there are no obvious patterns in expert assignments based on topics.
- 32:02 Notable regularities like consecutive tokens being assigned to the same expert and certain patterns in Python code token routing.
- 32:15 Consideration that the routing patterns might be either non-semantic or too complex for human interpretation.

*Conclusion and Future Outlook*
- 33:23 Mention of additional analysis available in the paper's appendix.
- 33:36 Positive outlook on the model's open-source release and its potential for new applications.
- 33:55 Discussion on the non-disclosure of the training data as a potentially smart but non-scientific approach.
- 34:12 Invitation for feedback on the best applications of Mixtral and anticipation for future open-source AI developments.

Disclaimer: I used chatgpt4 to summarize the video transcript. This
method may make mistakes in recognizing words.

wolpumba
Автор

Really said that data not only is getting gated but also now being omitted. This will not lead to much process outside of closed industry shops.

BobaQueenPanda
Автор

Thanks for the overview of this paper! I read it at one point because I was impressed with the results of the model itself, but it's always good to get a second look at it. Would that second look be a "Mixture of Experts"...?

Regardless, for anyone looking to do some homework and followup reading:
"Approximating Two-Layer Feedforward Networks for Efficient Transformers" has some pretty interesting findings about routing and scaling of MoE overall (apparently softmax is basically evil)
"Efficient shallow learning mechanism as an alternative to deep learning" argues that the brain isn't really comparable to a "deep" neural network and that a "wider" network may be more ideal for complex ideas. Depending on how far one was willing to stretch it, one could argue that MoE is an extension of or step in that direction.

I'd be really curious to see an overview of "Inferring neural activity before plasticity as a foundation for learning beyond backpropagation", though (unrelated to MoE). It's apparently an alternative to backpropogation that is more tolerant of higher learning rates, deeper networks while giving less overshoot and catastrophic forgetting. I'm having some difficulty getting through the paper because it's quite dense, but I think it's a really different look at training.

IsaiahGossner
Автор

Can you do the MoE-Mamba paper?

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

dibbidydoo
Автор

29:00, I think the input dataset is as important as the model design. Without good input, even the best models would fail to be interesting. With good data, even a poor model can perform in interesting ways. I totally agree that these researchers need to release their dataset since they're not actually providing their methods, needed for reproducibility.

dennisestenson
Автор

Having played with sparse-transformers using my own chess bot, a large model only partially perform like a smaller model, you still have to keep all the weights in your (GPU) memory, and it does limit your batch sizes for training and context lengths for inference.

vladimirtchuiev
Автор

I find the "8 experts" terminology and "8x7B" notation quite confusing and misleading. I cannot say how many professional practioners think it's "8 expert models" collaborating like in an ensemble way. It's actually 8 expert modules *per layer*, and there are 32 layers, so a total of 32x8 = 256 independent "experts" and not 8 experts. Plus if you really think in terms of end to end processing/activation path as one "expert", each token can have (8 choose 2)=28 possible expert paths per layer, and there are 32 layers, so the total of expert paths each single token can take is 28^32. So in reality, there are 28^32 end-to-end "expert" alternative activation paths for each token. So all in all, its either 8×32=256 experts or 28^32 experts, but defintely not 8 experts in this model

bajdoub