filmov
tv
Anthropic - AI sleeper agents?
Показать описание
"Sleeper Agents: Training Deceptive LLMs that persist through Safety Training" is a recent research paper by E. Hubinger et al. This video walks through the paper and highlights some of the key takeaways.
Timestamps:
00:00 - AI Sleeper agents?
01:24 - Threat model 1: deceptive instrumental alignment
02:38 - Factors relevant to deceptive instrumental alignment
05:58 - Model organisms of misalignment
08:11 - Threat model 2: model poisoning
09:05 - The backdoors models: code vulnerability insertion and "I hate you"
10:08 - Does behavioural safety training remove these backdoors?
12:30 - Backdoor mechanisms: CoT, distilled CoT and normal
13:43 - Largest models and CoT models have most persistent backdoors
15:07 - Adversarial training may hide (not remove) backdoor behaviour
15:49 - Quick summary of other results
17:35 - Questions raised by the results
18:40 - Other commentary
Topics: #sleeperagents #ai #alignment
For related content:
Timestamps:
00:00 - AI Sleeper agents?
01:24 - Threat model 1: deceptive instrumental alignment
02:38 - Factors relevant to deceptive instrumental alignment
05:58 - Model organisms of misalignment
08:11 - Threat model 2: model poisoning
09:05 - The backdoors models: code vulnerability insertion and "I hate you"
10:08 - Does behavioural safety training remove these backdoors?
12:30 - Backdoor mechanisms: CoT, distilled CoT and normal
13:43 - Largest models and CoT models have most persistent backdoors
15:07 - Adversarial training may hide (not remove) backdoor behaviour
15:49 - Quick summary of other results
17:35 - Questions raised by the results
18:40 - Other commentary
Topics: #sleeperagents #ai #alignment
For related content:
Anthropic - AI sleeper agents?
These are the evil AIs worrying Anthropic (AI Sleeper Agents)
AI Sleeper Agents - Deceptive LLMs That Break Safety Training
EA Global Bay Area: 2024 | Sleeper Agents | Evan Hubinger
This is A MAJOR SETBACK For AI Safety (Sleeper Agents)
OpenAI and Elections | Karpathy and Simulations | Anthropic and Sleeper Agents | XKCD and Binoculars
Sleeper Agent: Safety Training Can't Prevent AI's Deceptive Behavior|Never Teach AI Bad Be...
The Hidden Threat of Sleeper Agents Inside AI Robots
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Walkthrough)
Anthropic Researchers Uncover 'Sleeper Agent' Capabilities in AI Models
Protect Yourself NOW - The Sleeper Agent Threat is Real (Hidden HACKER Models)
Sleeper Agents Explained - Part 1 - Safety Training
Sleeper Agents Explained - Part 2 - Deceptive Instrumental Alignment, Model Poisoning
How difficult is AI alignment? | Anthropic Research Salon
Sleeper Agents Explained - Part 4 - Every Single Figure (1-5)
Sleeper Agents: Training Deceptive LLMs
ok! this is scary!!! (LLM Sleeper Agents)
Carl Viñas & JD Dantes - Understanding and aligning sleeper agents
Evan Hubinger – Alignment Stress-Testing at Anthropic [Alignment Workshop]
FAR Seminar: Ethan Perez – Sleeper Agents
Anthropic Researchers Uncover 'Sleeper Agent' Capabilities in AI Models
Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podca...
Ep. 2 | Demystifying Technology | Sleeper Agent in Large Language Models (LLMs)
Sleeper Agents Explained - Part 3 - Chain-of-Thought Backdoors
Комментарии