Efficient Streaming Language Models with Attention Sinks

preview_player
Показать описание
This paper introduces StreamingLLM, an efficient framework that allows large language models to generalize to infinite sequence length in streaming applications without fine-tuning. It addresses challenges related to memory consumption and text length, and achieves stable and efficient language modeling.

00:00 Section: 1 Introduction
08:14 Section: 3 StreamingLLM
11:31 Section: 3.2 Rolling KV Cache with Attention Sinks
17:35 Section: 4.2 Results of Pre-Training with a Sink Token
22:37 Section: Cache Sizes