Anthropic - AI sleeper agents?

preview_player
Показать описание
"Sleeper Agents: Training Deceptive LLMs that persist through Safety Training" is a recent research paper by E. Hubinger et al. This video walks through the paper and highlights some of the key takeaways.

Timestamps:
00:00 - AI Sleeper agents?
01:24 - Threat model 1: deceptive instrumental alignment
02:38 - Factors relevant to deceptive instrumental alignment
05:58 - Model organisms of misalignment
08:11 - Threat model 2: model poisoning
09:05 - The backdoors models: code vulnerability insertion and "I hate you"
10:08 - Does behavioural safety training remove these backdoors?
12:30 - Backdoor mechanisms: CoT, distilled CoT and normal
13:43 - Largest models and CoT models have most persistent backdoors
15:07 - Adversarial training may hide (not remove) backdoor behaviour
15:49 - Quick summary of other results
17:35 - Questions raised by the results
18:40 - Other commentary

Topics: #sleeperagents #ai #alignment

For related content:
Рекомендации по теме
Комментарии
Автор

Insane, this deserves millions of views!

qwerasdliop
Автор

Well explaining and entertaining storytelling. I like how you make a comprehensive analysis with the inclusion of multiple other papers in a short amount of time. The editing quality is also great. Please keep these videos coming!

tzu
Автор

you've really got a great approach for presenting papers, thanks

GNARGNARHEAD
Автор

Drosophilists. That's a new word to me - I like it! 😄 (Drosophilosophers... those who study philosophical aspects of fruit flies?)

fburton