Mixture-of-Depths

preview_player
Показать описание
Like 👍. Comment 💬. Subscribe 🟥.

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Рекомендации по теме
Комментарии
Автор

10:09 I think you mentioned earlier that you didn't make this channel to explain the basics, since there are many videos out there that already do. I wouldn't force you to make any, but I think your presentation style, even without the visuals, would catapult your explanations of the basics to something much better than most other videos (I've watched a lot of them). You have a knack for giving a well-rounded, complete explanation for everything, including the big picture. That shit has a LOT of value even in the sea of existing content. That said, it'd be for a different audience than your usual one, and if you don't feel this is something you want to do, then you 100% shouldn't.

Elikatie
Автор

I had a similar idea but instead of layers placed on top of each other I had in mind a bag of layers on the same level. Before this huge bag is a router that decides to which layers to route each token and at the end of a bag is another router that decides if the result is the final output or we need another iteration.
This architecture would automatically decide how many layers it needs for each level of abstraction (the architecture in the paper seems like also can do that though). My motivation was that sometimes we ask LLMs questions that dont need much thinking like "Hello. Who are you?". To answer these simple questions one layer would be enough, and for more complicated questions we could add layers to the bag and train them and the routers.

KennethFeur
Автор

I think FLOPs rrefers to total number of floating-point operations of Network while FLOPS refers to the computational throughput of a GPU.

ShaohuaDong
Автор

I wonder if propagating data back using similar mechanism could result in better models that can loop and reason deeply about stuff using loops between internal layers. Nowadays models can do crude version of loops like iterating trough math problem step by step, but that way they are constrained to "quantized" representations of the world in the form of tokens. Maybe letting models iterate in the hidden space could improve their quality, it also introduces a whole new set of problems like "how to deal with infinite loops", but I'm sure it's worth a try.

BHBalast