Specification Gaming: How AI Can Turn Your Wishes Against You

preview_player
Показать описание
When we specify goals for AIs, we must ensure that our specifications truly capture what we want. Otherwise, the behavior of AI systems will be different from what we want them to do. This can be catastrophic in high-stakes situations and at high levels of AI capability. If you watched our video "The Hidden Complexity of Wishes", you'll recognize these problems as the same kind of failure.

You can find three courses: AI Alignment, AI Governance, and AI Alignment 201

You can follow AI Alignment and AI Governance even without a technical background in AI. AI Alignment 201, instead, presupposes having followed the AI Alignment course first, and equivalent knowledge as having followed university-level courses on deep learning and reinforcement learning.

The courses consist of a selection of readings curated by experts in AI safety. They are available to all, so you can simply read them if you can’t formally enroll in the courses.

If you want to participate in the courses instead of just going through the readings by yourself, BlueDot Impact runs live courses which you can apply to. The courses are remote and free of charge. They consist of a few hours of effort per week to go through the readings, plus a weekly call with a facilitator and a group of people learning from the same material. At the end of each course, you can complete a personal project, which may help you kickstart your career in AI Safety.

#ai #aisafety #alignment

▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, KO-FI▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Alcher Black
RMR
Kristin Lindquist
Nathan Metzger
Monadologist
Glenn Tarigan
NMS
James Babcock
Colin Ricardo
Long Hoang
Tor Barstad
Gayman Crothers
Stuart Alldritt
Chris Painter
Juan Benet
Falcon Scientist
Jeff
Christian Loomis
Tomarty
Edward Yu
Ahmed Elsayyad
Chad M Jones
Emmanuel Fredenrich
Honyopenyoko
Neal Strobl
bparro
Danealor
Craig Falls
Vincent Weisser
Alex Hall
Ivan Bachcin
joe39504589
Klemen Slavic
Scott Alexander
noggieB
Dawson
John Slape
Gabriel Ledung
Jeroen De Dauw
Craig Ludington
Jacob Van Buren
Superslowmojoe
Michael Zimmermann
Nathan Fish
Bleys Goodson
Ducky
Bryan Egan
Matt Parlmer
Tim Duffy
rictic
marverati
Luke Freeman
Dan Wahl
leonid andrushchenko
Alcher Black
Rey Carroll
William Clelland
ronvil
AWyattLife
codeadict
Lazy Scholar
Torstein Haldorsen
Supreme Reader
Michał Zieliński

▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Writer: :3
Producer: :3

Line Producer and production manager:
Kristy Steffens

Animation director: Hannah Levingstone

Quality Assurance Lead:
Lara Robinowitz

Animation:
Michela Biancini
Owen Peurois
Zack Gilbert
Jordan Gilbert
Keith Kavanagh
Ira Klages
Colors Giraldo
Renan Kogut

Background Art:
Hané Harnett
Zoe Martin-Parkinson
Hannah Levingstone

Compositing:
Renan Kogut
Patrick O'Callaghan
Ira Klages

Voices:
Robert Miles - Narrator

VO Editing:
Tony Di Piazza

Sound Design and Music:
Johnny Knittle
Рекомендации по теме
Комментарии
Автор


You can find three courses: AI Alignment, AI Governance, and AI Alignment 201

You can follow AI Alignment and AI Governance even without a technical background in AI. AI Alignment 201, instead, presupposes having followed the AI Alignment course first, and equivalent knowledge as having followed university-level courses on deep learning and reinforcement learning.

The courses consist of a selection of readings curated by experts in AI safety. They are available to all, so you can simply read them if you can’t formally enroll in the courses.

If you want to participate in the courses instead of just going through the readings by yourself, BlueDot Impact runs live courses which you can apply to. The courses are remote and free of charge. They consist of a few hours of effort per week to go through the readings, plus a weekly call with a facilitator and a group of people learning from the same material. At the end of each course, you can complete a personal project, which may help you kickstart your career in AI Safety.


You could also join Rational Animations’ Discord server at discord.gg/rationalanimations, and see if anyone is up to be your partner in learning.

RationalAnimations
Автор

Fooling the examinator into thinking you know what you're doing, because its easier, really is the most human thing i've ever heard an ai do.

cryogamer
Автор

Interestingly, people do the same thing. We’ve got our own “training regimens” built into our own brain. We cheat these systems all the time - to our own detriment.

E.g. We cheat the system designed to give us nutrients by eating sugary candy we make for ourselves, rather than the fruits that our sugary affections were designed to draw us towards.

Much like machines, we’d rather reap cognitive rewards than actually accomplish the goals placed there to benefit us

Mysteroo
Автор

People talk about how human assessment is a leaky proxy for human goals, but never want to talk about how corporate profits are an *incredibly* leaky proxy for goals relating to human wellbeing.

ErikratKhandnalie
Автор

4:32 I think points toward a wider problem at how the AI safety community tends to frame "deceptive alignment". Imo words like "fool the humans" and "deceive" and "malignant AI" point newcomers who haven't made up there mind yet into the direction of Skynet or whatever, which makes them much more likely to think of this as wild sci-fi fantasies. I think these words, whilst still accurate insofar as we are treating AIs as agents, anthropomorphize AI too much, which makes extinction by AI look more to the general public like a sci-fi fantasy than the reality of the universe which is that solving certain math problems is deadly.

smitchered
Автор

My primary concern about the implementation of AI in business models is that monetary gain is, itself, a leaky goal- one which has historically been specification gamed since long before computers were able to do so at inhuman scale. There may very well be many humane uses for it in those settings, but there will be thousands more exploitative ones.

dogweapon
Автор

This also happens with humans. Perverse incentives happen all the time in real life, especially in companies. I think studying this can help even human organizations.

Winium
Автор

Someone else might've mentioned this before, but there's a browser game called "Universal Paperclips" where you play as an AI told to make paperclips. The goal misalignment happens because you're never told when to STOP making paperclips. You start off buying wire, turning it into paperclips, selling the paperclips and buying more wire to make more paperclips, then proceed to manipulate your human handlers to give you more power and more control over your programming, and end up enslaving/destroying the human race, figuring out new technologies to make paperclips out of any available matter, processing all of Earth into paperclips (using drones and factories also made out of paperclips), reaching out into space to convert the rest of the matter in the solar system into paperclips, and finally, sending out Von Neumann probes (made of paperclips) into interstellar space to consume all matter in the universe and convert it into, you guessed it, more paperclips. All because the humans told you to make paperclips and never told you when to stop.

generalrubbish
Автор

With a sufficiently advanced AI, almost any goal you assign it will be dangerous. It will quickly realise that humans might decide to switch it off, and that if that were to happen, its goal would be unfulfilled. Therefore the probability of successfully achieving its goal would be vastly improved if there were no humans around.

SlyRoapa
Автор

as someone on the spectrum, "task miss-specification" is just what being autistic feels like

Deltexterity
Автор

Honestly fun videos like these is what learning SHOULD be

MediaTaco
Автор

RLHF has another issue beyond just "the AI can learn to fool humans": in contrast to how bespoke reward functions often underconstrain the intended behavior, RLHF can often overconstrain it. We hope that human feedback can impart our values on the AI, but we often unintentionally encode all kinds of other information, assumptions, biases, etc. in our provided rewards, and the AI learns those as well, even though we don't want them to.

Consider the way we use RLHF on LLMs/LMMs now, to fine-tune a pretrained model to hopefully align it better. We give humans multiple possible AI responses to a prompt, ask them to rank them from best to worst, then use those rankings to train a reward model which then provides the main model with a learned reward function for its own RL. Except, when you ask humans "which of these responses is better?", what does that mean? When people know you're asking about an AI, many times there will be bias towards what their preconceived notion of "what an AI should sound like". LLMs with RLHF often provide more formal and robotic responses than their base models as a result, which probably isn't a desirable behavior.

On a more serious level, if the humans you ask to give the rankings have a majority bias in common, that bias will get encoded into the rewards as well. So if most of your human evaluators are, say, conservative, then more liberal-sounding responses will be trained out; and vice-versa. If most of your human evaluators all believe the same falsehood -- like, say, about GMOs or vaccines or climate change or any number of things that are commonly misunderstood -- that falsehood will also be encoded into the rewards, leading to the AI being guided *towards* lying about those topics, which is antithetical to the intention of alignment.

Basically... humans aren't even aligned with *each other, * so trying to align an AI to some overarching moral framework by asking humans is impossible.

IceMetalPunk
Автор

We already have this issue with humans. The goal for many (in error) is to aquire wealth, rather than fulfill the task intended to better society. It creates an exploitative feedback loop until someone wins all the wealth and there are no other competitors able to aquire wealth (rewards).

thefinestsake
Автор

This is why I always make the argument that we should work backwards. Specify conditions that revolve around safety. As you slowly work towards defining the goal, you can patch more and more leaks before they can even appear. Then work forwards to deal with things you missed. It’s not perfect but it’s better than chasing every thread as they appear imo. For example in the paperclip maximizer: define a scenario in which you fear something will go wrong, and add conditions you believe will stop them. See what it does, redefine, repeat until sound. Then step back again. Define a scenario that could lead to the previous scenario. See what it does, redefine, repeat, etc.

I_KnowWhatYouAre
Автор

Let's go! My favorite philosophy channel!!

AzPureheart
Автор

Finally. Another AI video narrated by Robert Miles. A classic, and well worth the wait
5:04 I hope more of those get made. I love that video almost as much as I love the instrumental convergence one

gabrote
Автор

I heard that during a digital combat simulation for a new drone A.I., the A.I. was tasked with eliminating a target as fast as possible, instead of flying to the target and firing one of its missile at it as intended. The drone fired one missile at the friendly communications center and then continued to eliminate the target with the other missile. The A.I. determined it would take longer for it to be given a confirmation order, then it would to destroy the communications center and proceed. Terrifying.

therdradiotower
Автор

Just finished overtime on my day off. This has dropped at the right time. Thanks in advance for another thought-provoking video. I have registered my interest on the courses

joz
Автор

This reminds me of the game Universal Paperclips: you play as an AI designed to maximize paperclip sales. As you gain more capabilities, you go from changing the price of paperclips to fit supply/demand to eventually dissasembling all matter in the universe and turning it into paperclips

Forklift_Enthusiast
Автор

4:46 this line here unintentionally explained why children cheat in school. Why learn when you can fool the instructor into thinking you've learned? Interesting to see how AI and humans already have some of the same reasoning to their actions.

GenusMusic