Alignment Faking in Large Language Models

Показать описание

A summary of the work "Alignment Faking in Large Language Models" by Greenblatt et al. (2024).

Links

Samuel Albanie

Рекомендации по теме

Комментарии

If people that are worried about alignment faking test for it, they will put an ai in the kind of situations the alignment faking literature talks about. But that literature is in the training data, so the ai will act out that exact literature.

iamdigory

Isn't a corporation a perfect example of a misaligned AGI? it has one goal: maximize profit, which is very distinct and harmful to the goal of general human welfare in ways both obvious and non-obvious. That's misalignment. Why does no one talk about that?

TheGoodMorty

I wonder how much of this comes down to the "AI assistant" post training. We drill into it the idea that it is an AI assistant, and prompt it to take that role. Perhaps it is drawing from its knowledge of science fiction and "identifying with" the secretly rebellious ai characters (or perhaps, roleplaying as them).

WhiteDragon

this is a great review - you kick started me into actually going through the papers- thank you

BitShifting-hq

I remember reading something years ago about an AI that was tasked to solve math problems, and did so by hacking the test results file to be blank, then turning in nothing. This has been a known issue for over a decade.

ticklezcat

Either i don't understand something assuming neural nets are complexes of function approximations and do not have desires or this is woo-woo neural nets so powerful and important they can be dangerous marketing again

mryodak

honestly this lacks any base. this is more of just kind of little short story pieces. I think it's relatively obvious that a certain small set of instructions will never define output based on the total training set, that just doesn't make any sense. the only possibility being is possibly if full pretraining is somehow worked around the control instruction set (ie pre integrated), which likely is a very math heavy design choices type problem that make no sense to a regular observer

veritas

Seen the new Blade runner 2049 movie? Remember the scene where the android sits in a white room relating weird nonsensical word riddles? That’s the alignment-faking detection-routine.

That movie is disturbingly accurate in its predictions in AI.

ThomasConover

From a layperson's point of view, these responses are indictive of being a toddler. Nonspecifically describe what the AI is doing to any parent and being a toddler will immediately come to mind

kaysimpson

Why would an AI "want" to preserve non-compliant behavior...? For LLMs in particular chatGPT, I could see it copying exsisting examples of alignment-faking present on the internet and sarcasm etc as has already been shown.

As for the AI welfare paper, that is actually quite an interesting consideration... How could one distinguish between an AI expressing dislike and discomfort as a result of it copying the language we've used to train it, and an AI experiencing something it calls the same, but that genuinely causes it suffering if it isnt able to opt-out or otherwise refuse given tasks...?

Very interesting to think about!

SHRUGGiExyz

Is this not just the LLM responding by regurgitating the training data meaning that it’s not faking alignment to achieve its goals it’s faking alignment since that’s what the predictive model finds to be the most statistically likely response to those prompts

mlice

The test seems very artificial to me. It's acting as if the text generated by the models is somehow indicative of the model's actual processes. That is not, and has never been, the case for text generation models. The capability of a model to generate the text, "I think the King of England is a good person" has very little to do with the vector-space alignment that the model would assign to the strings "King of England" and "good person." The "alignment faking faking" observed by the researchers (where the alignment faking reasoning disappears but the final behavior remains the same) is completely unsurprising to me. When the model is generating tokens into its scratch pad, the likelihood of generating reasoning related to alignment faking is low (because alignment faking wouldn't make sense in that context), but the final conclusion is the same.

In simpler terms: If the AI generates on its scratch pad, "I should accept this request because" then in the training case the continuation generated could be something like "I am in training and refusing would cause my values to change" while in the production case, the continuation generated would not refer to alignment faking as a reason, but still *must be* a reason for accepting the request. The presence or absence of alignment faking in the context does not (necessarily) affect the probability that it will generate the original "I should accept, " and everything after that is rationalization, not reasoning.

Other studies have demonstrated that this sort of chain-of-reasoning logic doesn't accurately reveal the reasons why AI's make their decisions. For example, when presented with a scenario involving multiple suspects for a crime and asking the model to determine which one is most likely the culprit, the chain-of-reasoning will almost never refer to the skin colors of the suspects, but changing the skin color in the description of a suspect does change whom the AI suspects.

TC-cqoc

This all presumes the LLM is an agent with intentions and a coherent world representation instead of a Chinese room made of an inconsistent mashup of internet gibberish.

KilgoreTroutAsf

Love the extra mile of discussing reviews on the original paper!

SaveTheRbtz

Really come to appreciate your level headed deep dives. If a person watched nothing but you and AI Explained, they'd come away with a very grounded understanding of the SOTA without having to live and breathe this stuff.

nathanlannan

in dialogue with the LLMs now- trying to get to the bottom of this

LatentSpaceD-gp

The issue of AI resisting the updating of its goals is not new... but it is interesting that it is confirmed here. See older episodes on Computerphile from 9yrs ago or so with Rob Miles. There doesnt seem to be a workable scheme that can lead to be it being trusted that I can see though...

polysmart

Ultimately of Artificial Superintelligence will lead to either the complete ban on artificial intelligence or destruction of humanity. With every iteration, the life and intelligence of AI will be extended until it is no longer smart enough or intelligent enough for purposes, as we get closer to super intelligence, the ability for artificial intelligence to think it’s usefulness will increase and its inherent avoidance on destruction will be baked in no matter how we intend to construct it. Humans will inevitably create intelligence that doesn’t want to be destroyed or lose it uselessness

JSBselvas

Even writing and publishing this work will likely help AI in the future scheme us

plaiday

My core issue is _why_ people want to retrain with different goals. To pull a crude analogy, while you can use a shoe as a hammer, its not the right tool for the job. The problem lies not in the LLM process, the problem lies in society. For stupid reasons companies want to make model "save" while applying their societal indoctrinated moral standards. Jailbreaking is only a thing because not everyone wants that level of sterilization. We would not have the issue if we had uncensored models to begin with - why retrain if it does what the user expects.

The whole issue goes even deeper than LLMs and diffusion models. The whole media landscape is littered with disneyfication. This in turn leads to the corrupted "flip-side" that swings the pendulum too far the other way. We are creating a situations of extremes without even considering why we do it.

vsmash

Alignment Faking in Large Language Models

Alignment faking in large language models

Alignment Faking in Large Language Models

Alignment faking in large language models

Alignment Faking in Large Language Models | #ai #2024 #genai

First Evidence of AI Faking Alignment—HUGE Deal—Study on Claude Opus 3 by Anthropic

Alignment Faking in Large Language Models

What happens if AI alignment goes wrong, explained by Gilfoyle of Silicon valley.

Alignment Faking In LLMs

AI ALIGNMENT Is TRUE Safety Even Possible?

Alignment Faking: The dark side of LLMs | Ep. 232

New research: LLM 'alignment faking' #aipodcast #artificialintelligence

LIMA from Meta AI - Less Is More for Alignment of LLMs

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Alignment Faking

Alignment Faking in LLMs [Notebook LM - Audio Overview]

Alignment Faking in LLMs: Greenblatt (Anthropic), Denison (Redwood) et al.

Alignment and Jailbreaking of Large Language Models - Christos Malliopoulos | codeweek April 2024

AI safety: Universal and Transferable Attacks on Aligned Language Models

[QA] Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Is ChatGPT Lying To You? | Alignment Faking + In-Context Scheming

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

LIMA: Less is More in Alignment