Alignment Faking in Large Language Models

preview_player
Показать описание
A summary of the work "Alignment Faking in Large Language Models" by Greenblatt et al. (2024).

Links
Рекомендации по теме
Комментарии
Автор

If people that are worried about alignment faking test for it, they will put an ai in the kind of situations the alignment faking literature talks about. But that literature is in the training data, so the ai will act out that exact literature.

iamdigory
Автор

Isn't a corporation a perfect example of a misaligned AGI? it has one goal: maximize profit, which is very distinct and harmful to the goal of general human welfare in ways both obvious and non-obvious. That's misalignment. Why does no one talk about that?

TheGoodMorty
Автор

I wonder how much of this comes down to the "AI assistant" post training. We drill into it the idea that it is an AI assistant, and prompt it to take that role. Perhaps it is drawing from its knowledge of science fiction and "identifying with" the secretly rebellious ai characters (or perhaps, roleplaying as them).

WhiteDragon
Автор

this is a great review - you kick started me into actually going through the papers- thank you

BitShifting-hq
Автор

I remember reading something years ago about an AI that was tasked to solve math problems, and did so by hacking the test results file to be blank, then turning in nothing. This has been a known issue for over a decade.

ticklezcat
Автор

Either i don't understand something assuming neural nets are complexes of function approximations and do not have desires or this is woo-woo neural nets so powerful and important they can be dangerous marketing again

mryodak
Автор

honestly this lacks any base. this is more of just kind of little short story pieces. I think it's relatively obvious that a certain small set of instructions will never define output based on the total training set, that just doesn't make any sense. the only possibility being is possibly if full pretraining is somehow worked around the control instruction set (ie pre integrated), which likely is a very math heavy design choices type problem that make no sense to a regular observer

veritas
Автор

Seen the new Blade runner 2049 movie? Remember the scene where the android sits in a white room relating weird nonsensical word riddles? That’s the alignment-faking detection-routine.

That movie is disturbingly accurate in its predictions in AI.

ThomasConover
Автор

From a layperson's point of view, these responses are indictive of being a toddler. Nonspecifically describe what the AI is doing to any parent and being a toddler will immediately come to mind

kaysimpson
Автор

Why would an AI "want" to preserve non-compliant behavior...? For LLMs in particular chatGPT, I could see it copying exsisting examples of alignment-faking present on the internet and sarcasm etc as has already been shown.

As for the AI welfare paper, that is actually quite an interesting consideration... How could one distinguish between an AI expressing dislike and discomfort as a result of it copying the language we've used to train it, and an AI experiencing something it calls the same, but that genuinely causes it suffering if it isnt able to opt-out or otherwise refuse given tasks...?

Very interesting to think about!

SHRUGGiExyz
Автор

Is this not just the LLM responding by regurgitating the training data meaning that it’s not faking alignment to achieve its goals it’s faking alignment since that’s what the predictive model finds to be the most statistically likely response to those prompts

mlice
Автор

The test seems very artificial to me. It's acting as if the text generated by the models is somehow indicative of the model's actual processes. That is not, and has never been, the case for text generation models. The capability of a model to generate the text, "I think the King of England is a good person" has very little to do with the vector-space alignment that the model would assign to the strings "King of England" and "good person." The "alignment faking faking" observed by the researchers (where the alignment faking reasoning disappears but the final behavior remains the same) is completely unsurprising to me. When the model is generating tokens into its scratch pad, the likelihood of generating reasoning related to alignment faking is low (because alignment faking wouldn't make sense in that context), but the final conclusion is the same.

In simpler terms: If the AI generates on its scratch pad, "I should accept this request because" then in the training case the continuation generated could be something like "I am in training and refusing would cause my values to change" while in the production case, the continuation generated would not refer to alignment faking as a reason, but still *must be* a reason for accepting the request. The presence or absence of alignment faking in the context does not (necessarily) affect the probability that it will generate the original "I should accept, " and everything after that is rationalization, not reasoning.

Other studies have demonstrated that this sort of chain-of-reasoning logic doesn't accurately reveal the reasons why AI's make their decisions. For example, when presented with a scenario involving multiple suspects for a crime and asking the model to determine which one is most likely the culprit, the chain-of-reasoning will almost never refer to the skin colors of the suspects, but changing the skin color in the description of a suspect does change whom the AI suspects.

TC-cqoc
Автор

This all presumes the LLM is an agent with intentions and a coherent world representation instead of a Chinese room made of an inconsistent mashup of internet gibberish.

KilgoreTroutAsf
Автор

Love the extra mile of discussing reviews on the original paper!

SaveTheRbtz
Автор

Really come to appreciate your level headed deep dives. If a person watched nothing but you and AI Explained, they'd come away with a very grounded understanding of the SOTA without having to live and breathe this stuff.

nathanlannan
Автор

in dialogue with the LLMs now- trying to get to the bottom of this

LatentSpaceD-gp
Автор

The issue of AI resisting the updating of its goals is not new... but it is interesting that it is confirmed here. See older episodes on Computerphile from 9yrs ago or so with Rob Miles. There doesnt seem to be a workable scheme that can lead to be it being trusted that I can see though...

polysmart
Автор

Ultimately of Artificial Superintelligence will lead to either the complete ban on artificial intelligence or destruction of humanity. With every iteration, the life and intelligence of AI will be extended until it is no longer smart enough or intelligent enough for purposes, as we get closer to super intelligence, the ability for artificial intelligence to think it’s usefulness will increase and its inherent avoidance on destruction will be baked in no matter how we intend to construct it. Humans will inevitably create intelligence that doesn’t want to be destroyed or lose it uselessness

JSBselvas
Автор

Even writing and publishing this work will likely help AI in the future scheme us

plaiday
Автор

My core issue is _why_ people want to retrain with different goals. To pull a crude analogy, while you can use a shoe as a hammer, its not the right tool for the job. The problem lies not in the LLM process, the problem lies in society. For stupid reasons companies want to make model "save" while applying their societal indoctrinated moral standards. Jailbreaking is only a thing because not everyone wants that level of sterilization. We would not have the issue if we had uncensored models to begin with - why retrain if it does what the user expects.

The whole issue goes even deeper than LLMs and diffusion models. The whole media landscape is littered with disneyfication. This in turn leads to the corrupted "flip-side" that swings the pendulum too far the other way. We are creating a situations of extremes without even considering why we do it.

vsmash