Efficient Streaming Language Models with Attention Sinks

Показать описание

This paper introduces StreamingLLM, an efficient framework that allows large language models to generalize to infinite sequence length in streaming applications without fine-tuning. It addresses challenges related to memory consumption and text length, and achieves stable and efficient language modeling.

00:00 Section: 1 Introduction
08:14 Section: 3 StreamingLLM
11:31 Section: 3.2 Rolling KV Cache with Attention Sinks
17:35 Section: 4.2 Results of Pre-Training with a Sink Token
22:37 Section: Cache Sizes

Arxiv Papers