Mamba: Linear-Time Sequence Modeling with Selective State Spaces

preview_player
Показать описание
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. A key weakness of such models is their inability to perform content-based reasoning. Hence, this paper proposes several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, the authors design a hardware-aware parallel algorithm in recurrent mode. They integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

In this video, I talk about the following: What is a state space linear system? What is a linear state-space layer (LSSL)? What is a Structured State Space sequence (S4) model? What is the problem with S4 models? What are S6 models? What is Mamba’s architecture? How does Mamba perform?

Gu, Albert, and Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." (2023).
Рекомендации по теме
Комментарии
Автор

Great explanation. Looking forward to the Mamba 2 paper

astaragmohapatra
Автор

Excellent explanation! Please make a video on Mamba 2 as well

vini