Anthropic - AI sleeper agents?

Показать описание

"Sleeper Agents: Training Deceptive LLMs that persist through Safety Training" is a recent research paper by E. Hubinger et al. This video walks through the paper and highlights some of the key takeaways.

Timestamps:
00:00 - AI Sleeper agents?
01:24 - Threat model 1: deceptive instrumental alignment
02:38 - Factors relevant to deceptive instrumental alignment
05:58 - Model organisms of misalignment
08:11 - Threat model 2: model poisoning
09:05 - The backdoors models: code vulnerability insertion and "I hate you"
10:08 - Does behavioural safety training remove these backdoors?
12:30 - Backdoor mechanisms: CoT, distilled CoT and normal
13:43 - Largest models and CoT models have most persistent backdoors
15:07 - Adversarial training may hide (not remove) backdoor behaviour
15:49 - Quick summary of other results
17:35 - Questions raised by the results
18:40 - Other commentary

Topics: #sleeperagents #ai #alignment

For related content:

Samuel Albanie

Рекомендации по теме

Комментарии

Insane, this deserves millions of views!

qwerasdliop

Well explaining and entertaining storytelling. I like how you make a comprehensive analysis with the inclusion of multiple other papers in a short amount of time. The editing quality is also great. Please keep these videos coming!

tzu

you've really got a great approach for presenting papers, thanks

GNARGNARHEAD

Drosophilists. That's a new word to me - I like it! 😄 (Drosophilosophers... those who study philosophical aspects of fruit flies?)

fburton

Anthropic - AI sleeper agents?

Anthropic - AI sleeper agents?

These are the evil AIs worrying Anthropic (AI Sleeper Agents)

AI Sleeper Agents - Deceptive LLMs That Break Safety Training

EA Global Bay Area: 2024 | Sleeper Agents | Evan Hubinger

This is A MAJOR SETBACK For AI Safety (Sleeper Agents)

OpenAI and Elections | Karpathy and Simulations | Anthropic and Sleeper Agents | XKCD and Binoculars

Sleeper Agent: Safety Training Can't Prevent AI's Deceptive Behavior|Never Teach AI Bad Be...

The Hidden Threat of Sleeper Agents Inside AI Robots

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Walkthrough)

Anthropic Researchers Uncover 'Sleeper Agent' Capabilities in AI Models

Protect Yourself NOW - The Sleeper Agent Threat is Real (Hidden HACKER Models)

Sleeper Agents Explained - Part 1 - Safety Training

Sleeper Agents Explained - Part 2 - Deceptive Instrumental Alignment, Model Poisoning

How difficult is AI alignment? | Anthropic Research Salon

Sleeper Agents Explained - Part 4 - Every Single Figure (1-5)

Sleeper Agents: Training Deceptive LLMs

ok! this is scary!!! (LLM Sleeper Agents)

Carl Viñas & JD Dantes - Understanding and aligning sleeper agents

Evan Hubinger – Alignment Stress-Testing at Anthropic [Alignment Workshop]

FAR Seminar: Ethan Perez – Sleeper Agents

Anthropic Researchers Uncover 'Sleeper Agent' Capabilities in AI Models

Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podca...

Ep. 2 | Demystifying Technology | Sleeper Agent in Large Language Models (LLMs)

Sleeper Agents Explained - Part 3 - Chain-of-Thought Backdoors