Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

preview_player
Показать описание
#mamba #s4 #ssm

OUTLINE:
0:00 - Introduction
0:45 - Transformers vs RNNs vs S4
6:10 - What are state space models?
12:30 - Selective State Space Models
17:55 - The Mamba architecture
22:20 - The SSM layer and forward propagation
31:15 - Utilizing GPU memory hierarchy
34:05 - Efficient computation via prefix sums / parallel scans
36:01 - Experimental results and comments
38:00 - A brief look at the code

Abstract:
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Authors: Albert Gu, Tri Dao

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

OUTLINE:
0:00 - Introduction
0:45 - Transformers vs RNNs vs S4
6:10 - What are state space models?
12:30 - Selective State Space Models
17:55 - The Mamba architecture
22:20 - The SSM layer and forward propagation
31:15 - Utilizing GPU memory hierarchy
34:05 - Efficient computation via prefix sums / parallel scans
36:01 - Experimental results and comments
38:00 - A brief look at the code

YannicKilcher
Автор

I have to say - your return to making more frequent videos is making me very happy. I used to see your video before reading the papers.

HoriaCristescu
Автор

This is a great video with a great rundown of Mamba. Was traveling when the Mamba paper came out and coincidentally stumbled upon this video today. This was a big time-saver to catch me up on the gist of itl. I'll make sure to watch more of your videos in the future. Big thumbs up!

SebastianRaschka
Автор

These kinds of videos are great on an early christmas morning. You know you are still not really awake. You won't get it anyway. But it kickstarts your brain into work mode.

OperationDarkside
Автор

11:15 They did experiments up to 3 billion parameters iirc. There is a mamba 3B model available on huggingface at least

stephaneduhamel
Автор

This is definitely the best explanation video for Mamba I've seen. Thank you!

YangLi-gwnb
Автор

This is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements

albertmashy
Автор

Papers! TBH, I don’t even watch any other vids in this channel.

vzxvzvcxasd
Автор

Could see this as a "memory" architecture for an actual transformer, remembering distinctive contexts for a long time, but use transformers for the much more complicated and sophisticated logical reasonings where directed focus and attention is much needed.

jonatani
Автор

I'd love another video diving deeper!

barni_
Автор

I think this is very similar to "Retentive Network", which Yannic covered few months ago. State transition model recalls me linear Kalman filter. Anyway, I cannot believe single vector memory can carry all necessary information for every token, which fit for all.

kimchi_taco
Автор

I work with state space models as a a control/optimization engineer on a daily basis. But that diagram of the state space model has got to be the most confusing thing I’ve seen in my life lol

josephle
Автор

Finally, i knew i could count on you!

Summersault
Автор

Thank you for the paper review, it always helps!! Happy holydays to everyone 🍾

pladselsker
Автор

More than the merits and demerits of transformers. The best part is it's inter modalities between text-audio-voice clip

bibhabasumohapatra
Автор

This looks a lot like state-space control theory representation. What they are presenting is basically a learnable dynamic system with a linear state transition matrix A and an input dependent input matrix B, that makes it non-linear, same as for the observation matrix C. Look as a massive upgrade over transformers for stuff like music generation, maybe even ViT-based models. What isn't clear to me is how do they learn the A matrix, it seems that the farther the context is, the more severe the vanishing gradient problem and the nearest elements in the sequence is by far the most significant.

vladimirtchuiev
Автор

Very happy transformers aren't the only game in town.

jsalsman
Автор

Interesting that A^n actually works for long sequences. I would have expected a severe degradation of performance as sequences get longer...

TheEbbemonster
Автор

I think something is off about your explanation of the A_t prefix products around 35min. The dimensions given in Algorithm 2 imply that A remains constant across timesteps, since it has no L component.

jonsimonatwork
Автор

Цікава тема та дослідження, які пояснюються в цій статті. Рекомендую переглянути!

РудаковАртем