MAMBA and State Space Models explained | SSM explained

preview_player
Показать описание
We simply explain and illustrate Mamba, State Space Models (SSMs) and Selective SSMs.
SSMs match performance of transformers, but are faster and more memory-efficient than them. This is crucial for long sequences!

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, Michael

Outline:
00:00 Mamba to replace Transformers!?
02:04 State Space Models (SSMs) – high level
03:09 State Space Models (SSMs) – more detail
05:45 Discretization step in SSMs
08:14 SSMs are fast! Here is why.
09:55 SSM training: Convolution trick
12:01 Selective SSMs
15:44 MAMBA Architecture
17:57 Mamba results
20:15 Building on Mamba
21:00 Do RNNs have a comeback?
21:42 AICoffeeBreak Merch

Great resources to learn about Mamba:

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Join this channel to get access to perks:
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​

Scientific advising by Mara Popescu
Video editing: Nils Trost
Music 🎵 : Sunny Days – Anno Domini Beats
Рекомендации по теме
Комментарии
Автор

I've a question. Given that SSMs are entirely linear, how do they conform with universal approximation theorem? I mean a lack of non-linear activation should imply that they should be particularly bad at approximate functions, but they are not.
Am i missing something?

Also really loved the video!

drummatick
Автор

Thanks! Looking forward to a Hyena video :)

partywen
Автор

I have to give a presentation on Mamba next week and I've been waiting for this video so I could finally learn what the hell I need to talk about

ShadowHarborer
Автор

Thank you for the shoutout to me repo!
I later realized it was an application of a known idea "heisen sequence", which is a pretty cool way to do certain associative scan operations via cumsum

peabrane
Автор

This is exactly the level of detail I needed right now. Thank you so much!

jamescunningham
Автор

Thank you very much for this thorough, well-curated, and comprehensive review of MAMBA.

OlgaIvina
Автор

Hats off to you for this amazing video! Best explanation of Mamba I have seen.

cosmic_reef_
Автор

I was waiting for exactly this topic! Thanks so much!

DerPylz
Автор

Nice video, good overview, which is what I was searching for

faysoufox
Автор

A big thanks for a comprehensive explanation of the Mamba Architecture & computations, @AICoffeeBreak!

ruchiradhar
Автор

Nice T-shirt! So excited to listen about new models!

harumambaru
Автор

Awesome video! I especially like the simple explanation and the visuals.

Emresessa
Автор

Thank you! This is by far the easiest-to-understand and most concise video that teaches the concepts of SSMs

李洛克-mu
Автор

this explanation was excellent. Thank you very much :)

hannes
Автор

Thank you so much !! you really super simplified it for any beginner level deep learner to understand

kumarivin
Автор

Thanks for the MAMBA video!

I always appreciate your insight on these new, influential papers! Your thoughts always pair well with a good cup of coffee. 😁☕️

MaJetiGizzle
Автор

Great.
There are a lot of failed explanation or completely wrong approach about SSM and Mamba on the internet, but finally I found the exact what I want.
Thank you for the video.

고준성-mg
Автор

@AICoffeBreak, thank you for the awesome video. Very small pet peeve which had me re-check all the math. At 11:20 it would make the explanation much easier to understand if you kept x 0 indexed as that is the notation you were going for since the beginning. Also, maybe making it explicit that you're taking t = L, although this is kind of obvious. This was an awesome lecture, thank you again.

rodrigomeireles
Автор

Thank you for the great Mamba explanation

serta
Автор

I'm not entirely sure on how SSM differ from RNNs, especially regarding how attention is being used. Theres still the bottleneck of h_t to h_{t+1} between time steps, which was one of the motivations for the attention layer -- so that information in one part of the sequence doesn't have to be squeezed before computation with information from another part of the sequence.
Is the main innovation from RNN to SSM the fixed delta, A, B, C formulation such that the training can be done in parallel for all time steps?

darkswordsmith