1 Million Tiny Experts in an AI? Fine-Grained MoE Explained

preview_player
Показать описание

Mixture of Experts explained, well, re-explained. We are in the Fine-Grain era of Mixture of Experts and it's about to get even more interesting as we further scale it up.

This video was sponsored by Brilliant

Check out my newsletter:

Special thanks to LDJ for helping me with this video

Mixtral 8x7B Paper

Sparse MoE (2017)

Adaptive Mixtures of Local Experts (1991)

Gshard

Branch-Train Mix

DeepSeek-MoE

MoWE (from the meme at 7:51)

Mixture of A Million Experts

This video is supported by the kind Patrons & YouTube Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Robert Zawiasa, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi, Hector, Drexon, Claxvii 177th, Inferencer, Michael Brenner, Akkusativ, Oleg Wock, FantomBloth

[Music] massobeats - daydream
[Video Editor] @Askejm
Рекомендации по теме
Комментарии
Автор

Like this comment if you wanna see more MoE related content, I have quite a good list for a video;)

bycloudAI
Автор

Imagine assembling 1 milliont PhD students together to discuss someone's request like "write a poem about cooking eggs with c++". Thats MoE irl

progameryt
Автор

The only thing in my mind is "MoE moe

maickelvieira
Автор

to some extent this seems closer to how brains work

gemstone
Автор

i see what you did there with "catastrophic forgetting" lmao 🤣

randomlettersqzkebkw
Автор

1991... We are standing on the shoulders of giants.

ChristophBackhaus
Автор

It's crazy how Meta's 8B parameter Llama 3 model has nearly the same performance as the original GPT-4 with 1.8T parameters.

That's a 225x reduction in compute in just 2 years.

GeoMeridium
Автор

I watch your videos yet I have no idea what you are explaining 99% of the time. 🙃

Saphir__
Автор

Now I really am excited for a 800B model with fine-grained MoE to surface that I can run on basically any device.

Quantum_Nebula
Автор

These videos format is GOLD 🏆 such specific and nerdy topics produced as memes 😄

AkysChannel
Автор

3:37 wasn't it just yesterday that they released their model 😭

lazyalpaca
Автор

Thank u for linking the papers in the description ❤

farechildd
Автор

I watch you so that I feel smart, it really works!

cdkw
Автор

Damn.. You blew my mind on the 1 million experts and Forever learning thing

simeonnnnn
Автор

In a very real sense, the MoME concept is similar to diffusion networks. On their own, the tiny expert units are but grains of noise in an ocean of and the routing itself is the thing being trained. Whether or not it's more efficient than having a monolithic neural net with simpler computation units I dunno. I suspect like most things ML, there is probably a point of diminishing return.

KCMNJL
Автор

Was hoping someone would make a video on this! Thank you! Would love to see you cover Google's new Diffusion Augmented Agents paper.

pathaleyguitar
Автор

Yo dog, I heard you liked AI so we put an AI inside your AI which has an AI in the AI which can AI another AI so that you can AI while you AI.

shApYT
Автор

Mixture of a million experts just sounds like a sarcastic description of Reddit

NIkolla
Автор

Bro, but did you read about Lory? It merges models with soft merging building on several papers. Lory is new paint on a method developed for vision AI to make soft mergers possible for LLM's. ❤

ickorling
Автор

Actually really cool idea, i liked the deep seek meo version too, it's so clever

soraygoularssm