[QA] Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

preview_player
Показать описание
The paper investigates extreme-token phenomena in transformer-based LLMs, revealing mechanisms behind attention sinks and proposing strategies to mitigate their impact during pretraining.

Рекомендации по теме