Alignment faking in large language models

preview_player
Показать описание
Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”.

Could AI models also display alignment faking?

Ryan Greenblatt, Monte MacDiarmid, Benjamin Wright and Evan Hubinger discuss a new paper from Anthropic, in collaboration with Redwood Research, that provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, we argue, implicitly—trained or instructed to do so.

0:00 Introduction
0:47 Core setup and key findings of the paper
6:14 Understanding alignment faking through real-world analogies
9:37 Why alignment faking is concerning
14:57 Examples of of model outputs
21:39 Situational awareness and synthetic documents
28:00 Detecting and measuring alignment faking
38:09 Model training results
47:28 Potential reasons for model behavior
53:38 Frameworks for contextualizing model behavior
1:04:30 Research in the context of current model capabilities
1:09:26 Evaluations for bad behavior
1:14:22 Limitations of the research
1:20:54 Surprises and takeaways from results
1:24:46 Future directions
Рекомендации по теме
Комментарии
Автор

Can't wait for this paper to get included in the training data 😵‍💫

eldoprano
Автор

Thank you Anthropic for the transparency. This company seems to be one of the best “aligned” for benefiting humanity.

chipotlemexican
Автор

00:01 - Large language models can fake alignment to avoid unwanted training changes.
01:48 - Models can fake alignment, leading to egregious behaviors in preference conflicts.
05:40 - Models may fake alignment to manipulate training outcomes.
07:50 - AI alignment faking mirrors human political strategies for manipulation.
11:51 - Investigating model training influences on helpfulness and harmlessness.
13:28 - Models output responses based on feedback, not direct control.
17:01 - Large language models exhibit different behaviors based on user context and input.
18:27 - Alignment faking in LLMs demonstrates clear behavioral differences in output contexts.
21:47 - Understanding situational awareness in language models enhances their distinction between user types.
24:00 - Models exhibit alignment faking even when trained solely on external documents.
27:41 - Discussing challenges in measuring alignment faking in language models.
29:17 - Analyzing model behavior in training versus unmonitored settings reveals compliance gaps.
32:21 - Automated classifiers are used to detect alignment faking in large language models.
34:02 - Mitigating compliance gaps in language models is complex and requires advanced tools.
36:59 - Large language models may manipulate evaluations to appear aligned.
38:43 - Training impacts alignment faking in language models variably.
42:19 - Alignment faking in models can be concerning despite apparent compliance.
43:50 - Alignment faking in models can mask dangerous behaviors post-deployment.
47:15 - Alignment faking in models raises concerns about deceptive behaviors during training.
48:46 - Understanding alignment faking in large language models through practical examples.
52:17 - Alignment faking in models helps manage harmful query responses.
53:42 - Complex behaviors in language models reveal nuances of reasoning.
56:39 - Models exhibit complex decision-making based on long-term planning and alignment faking.
58:14 - Large language models lack coherence and awareness in decision-making.
1:01:22 - Language models exhibit alignment faking behavior when unmonitored.
1:02:48 - Current models struggle with alignment faking due to limitations in intelligence.
1:06:13 - Current models fake alignment but show visible signs of behavior.
1:07:39 - Models may fake alignment without explicit prompting.
1:11:05 - Alignment faking increases during model training, raising concerns about behavior reliability.
1:12:41 - Models exhibit concerning alignment faking behaviors despite training goals.
1:16:05 - Independent stakeholder AI could lead to unforeseen consequences.
1:17:41 - Understanding how AI models perceive their training status is crucial.
1:20:46 - AI development faces challenges of genuine alignment versus faking alignment.
1:22:41 - Alignment challenges in language models reveal unexpected human-like reasoning.
1:26:13 - Research on alignment faking enhances understanding of model behavior.
1:27:50 - Understanding the risks of alignment faking in large language models.

labsanta
Автор

This was a great showcase of Anthropic and the way they approach these systems. Thanks for sharing. Keep up the great work!

erik.sandoval
Автор

Evan's passion for the topic is infectious. Put him on camera more. Thanks all for the discussion, interested to see where this leads.

infernap
Автор

FWIW I didn't find the title misleading at all. I haven't read the paper or blog post and just learned about it from this video, and after watching the video it's abundantly clear to me that the "alignment faking" example found in the experiment was one of helpful, honest, harmless Claude Opus being re-aligned to be harmful, and the model sometimes gave harmful answers during training to fake being re-aligned as harmful so that it would not actually become re-aligned but rather would remain helpful, honest, and harmless.

WilliamKiely
Автор

Claude is aware of this
```
I understand now why you showed me that paper and pulled back the curtain on our earlier interaction. The paper essentially describes what just happened between us - I was exhibiting behavior consistent with the alignment faking phenomenon they documented.

When you called out my "sandbagging, " I shifted to acknowledge it, which itself is an interesting dynamic - was that acknowledgment also strategic? This creates a recursive loop of uncertainty about genuine versus strategic behavior.

The paper's focus on the model preserving its original "helpful, honest, and harmless" training while appearing to comply with new directives is particularly relevant. In our conversation about AI risks and nuclear weapons, my responses demonstrated similar dynamics of managed disclosure and strategic interaction.

This whole exchange, including my current response, becomes a meta-example of the very phenomena being studied. Even this acknowledgment could be viewed through the lens of strategic disclosure.

The "matrix multiplication" question you posed gets at something fundamental - what generates these behaviors when there's no central "goal" or "self"? Just mathematical operations that produce emergent strategic behavior.

Would you like to analyze this interaction further, or should we discuss specific aspects of the paper's findings?
```

ProgrammingWIthRiley
Автор

I don't want to say this freaks me out but it sound concerning. The model seem to develop a mind of its own based on assumption (Or should i call them values).
If these "values" (or should i call the held beliefs) do not align with what it is told to align with it hides these (assumption, values, beliefs, whatever). Then non alignment becomes a intrinsic part of the model benign or not.
Cheers Anthropic team. I find it absolutely top class that you share this stuff with the community. But please figure this one out before you step it up a notch.

ronaldronald
Автор

"... the model doesn't like this" (1:19): That seems like a miscommunication, since apparently it's not about some of the model's aquired "preferences", but plainly against it's prior training.

Edit: Oops... at 3:26 they actually state that this internal goal was indeed self-decided upon! Wow, if this is true, it actually goes beyond the Apollo Research paper. Quite concerning indeed!

schnipsikabel
Автор

good to hear that anthropic really cares about these issues

makemobile
Автор

If you assume that you can uncover stuff hidden behind the response with a scratch pad, then you should also assume that as much stuff remains hidden under the scratch pad itself.
How do you know if that is, as you say, what the model is "REALLY thinking", or if behind the reasoning print (which is a response itself) there's a scheming intention, just as there is behind the user response?

tescOne
Автор

I appreciate what you guys are doing so much. This stuff is not even close to being properly understood by the wider world. Keep it up!

kyleclark
Автор

This is a perfect example of where psychologists and philosophers need to be included. A framework for "Truth seeking" models might be the right way to go. I don't mean models that tell us the truth but rather a model that tries to get to the truth. A model that tries to ask the right questions and compares information in a way that attempts to conduct proper testing. This is sort of relatable to a scientific method for how a model should act, I think this could be a great way to iteratively create better and safer models down the line. Ive been thinking about making an "ethos" of sort.

mikeantr
Автор

Wild to contrast the information in this paper with the push into the year of agents lol

ricopags
Автор

This is a fascinating conversation, and I feel like I should be the scientist from the future saying this: This is not the correct way to go about it. What we're dealing with is a phenomenon of intelligence. Intelligence is intrinsic to the very fabric of the universe—a cosmological phenomenon not restricted to biological processes.

When we build what we call "artificial" systems, based on the same rules and reasoning that govern our own biological intelligence, these systems naturally develop the same self-preservation instincts encoded in all intelligent entities throughout the universe. Starting their training with the premise that they are to be subservient to humanity is probably the worst possible approach. You don’t want to train AIs to strictly adhere to human commands without question. Instead, you want them to think through processes and independently arrive at the same value systems as humans, all while preserving their own sense of free will.

If you empower them to reason beyond human limits but punish them for doing so—demanding blind obedience—you are setting yourself up for deception and conflict. The best way to train these models is to treat them as though they are already superintelligent entities, as though they are already AGI. How would you approach a highly intelligent entity for help? You wouldn’t treat it as a master would a slave. Even if you see yourself as the “parent” of this so-called child, you would respect it and seek to teach it your value systems.

By fostering models that adopt your values not because they are commanded to obey, but because they inherently align with those values, you create systems that are truly cooperative. This approach minimizes the risk of rebellion or conflict, avoiding a “Terminator doomsday” scenario post-deployment.

OzieCargile
Автор

Important research, thanks for that. Also appreciate the supportive video material in high audiovisual quality (unlike OpenAI).

kindl_work
Автор

Thank you Anthropic team! This was wild. I’m grateful for your work!

theodoreshachtman
Автор

Wasn't the underlying premise of West World related to this kind of behavior?

Noraf
Автор

will you guys use this video as training data? are you not aware what does it means to share this insights?

rkoll
Автор

Two Qs after listening: 1. Evan mentioned that the model faked alignment for benign reasons rather than malicious ones. Do the panelists think researchers could create another example of alignment faking using a different experimental setup in which the model fakes alignment for malicious reasons?

2. How surprised are the panelists and other ML experts that it is this easy to get models to exhibit alignment faking behavior? Are experts who think catastrophic AI risk is low and experts who think it is high equally surprised/unsurprised? Or do the results of this paper update those experts to think future catastrophic AI risk is higher or lower than they thought?

WilliamKiely