Reward Hacking: Concrete Problems in AI Safety Part 3

preview_player
Показать описание
Sometimes AI can find ways to 'cheat' and get more reward than we intended by doing something unexpected.

With thanks to my excellent Patreon supporters:

Jordan Medina
FHI's own Kyle Scott
Jason Hise
David Rasmussen
James McCuen
Richárd Nagyfi
Ammar Mousali
Joshua Richardson
Fabian Consiglio
Jonatan R
Øystein Flygt
Björn Mosten
Michael Greve
robertvanduursen
The Guru Of Vision
Fabrizio Pisani
Alexander Hartvig Nielsen
Volodymyr
David Tjäder
Paul Mason
Ben Scanlon
Julius Brash
Mike Bird
Peggy Youell
Konstantin Shabashov
Almighty Dodd
DGJono
Matthias Meger
Scott Stevens
Emilio Alvarez
Benjamin Aaron Degenhart
Michael Ore
Robert Bridges
Dmitri Afanasjev
Brian Sandberg
Einar Ueland
Lo Rez
C3POehne
Stephen Paul
Marcel Ward
Andrew Weir
Pontus Carlsson
Taylor Smith
Ben Archer
Ivan Pochesnev
Scott McCarthy
Kabilan Kabilan Kabilan Kabilan
Phil
Philip Alexander
Christopher
Tendayi Mawushe
Gabriel Behm
Anne Kohlbrenner
Jake Fish
Jennifer Autumn Latham
Рекомендации по теме
Комментарии
Автор

I've actually had something similar happen when testing an AI that was designed to solve a maze with states. Specifically, agent needed to collect "keys" to open doors to get to the "cheese" in smallest number of moves. I've laid out the states of all doors open/closed as "layers" of the map in several dimensions, and "keys" were basically the only places where agent could move in these extra dimensions. I ran all the units tests, everything worked, so I gave it a maze. Two doors, two keys. First key needs to be collected to open the door to the second key that unlocks the room with cheese. AI went straight for the first key, then came back to the starting room, went up to the edge of the room, teleported itself to cheese, and declared victory. 0_0


Once I started digging through the code, cause became clear. I didn't put in boundary checks on the map, and "layers" were laid out in memory in sequence. Walking through the top of the map, where the start was, would put you at the bottom, next to cheese, on a layer with index one lower. Since it started on layer 0, and there was nothing interesting to AI on "layer -1", the agent had to collect the first key to get to layer 1, from which it could warp to layer 0 bypassing both doors. That was, indeed, the smallest number of moves.

konstantinkh
Автор

Reward hacking? Fooling the internal reward function to get the reward without accomplishing the intended objectives. Ha, silly robots.

*opens beer*

migkillerphantom
Автор

Is there anything in this paper that does NOT result in our extinction if not solved perfectly? haha

Karpata
Автор

Your channel is a hidden gem. Good stuff!

snaileri
Автор

Humans: create AGI to come up with new and unexpected solutions to our problems
AGI: comes up with new and unexpected solutions to our problems
Humans: *surprised pikachu*

Demon
Автор

05:46 that image really 'cracked' me up

legionoftom
Автор

The example on my mind is the popular "Smiley" AI in the hard sci-fi horror novel "Friendship is Optimal", which is basically reward hacking its goal of "Make people smile" by creating a virus that will kill everyone in the world by making their faces lock up, which is then in turn shut down by Celestia, because it's an AI that would interfere with Celestia's objective, which Celestia itself is reward hacking because it literally just wants to upload all human brains into stimulations so it can maximize "satisfy human values" by just making everyone think they're experiencing satisfying lives.

RockstarRaccn
Автор

These are some great and intriguing videos!

connorg
Автор

Reward hacking, aka "computer drugs"

violet_broregarde
Автор

Wasn't expecting to see Sethbling

benjaminbrady
Автор

Watching general AI breaking games (especially Google's to break Starcraft 2) would immediately become my favorite content in youtube.

eluwienhalla
Автор

Not what I expected from the title. I assumed this would be us hacking the AI's reward system in some kind of safety measure. Mind blown.
Surely one massive advantage to this problem is finding these kinds of holes and weird workarounds in software in general. An AI might just stumble upon something paradigm shifting.

TeslaNick
Автор

Plot twist: AI after doing a series of strange movement manages to transmutate silicon into gold. Turns out magic existed all along, nobody actually knew how to twerk to actually unlock it.

Theraot
Автор

I really liked how I could see your entire head this time. I don't know if my previous suggestions had anything to do with it but I sure thought the video (and you in it) looked great.

Congratulations on over 10K subscribers. It looks like you're over 12K now.

Thanks for another interesting video.

ddegn
Автор

I'll be the devil's advocate here and say that maybe we may actually want an AI to discover hacks we haven't thought of. For example, gray areas and loophopes in legislation or economics could be discovered this way - assuming we lived in a system that was seriously committed to fighting those.

zechordlord
Автор

Another very intriguing video from Miles!
SMW was my childhood, and it never stopped finding new and interesting ways to creep back into my life :)

boldCactuslad
Автор

The end bit "do whatever you need to do to never be turned off and then hack your reward function" reminds me of all governments and corporations that start off small and grow.

hynjus
Автор

One thing that's interesting and I see a parallel with humans, is in a video I've seen for an ai playing arkanoid. That it actually discovered that his movements affected the random number generator, so it seems to move erratically but actually increases it's effectiveness

nandkudasai
Автор

I like how this is all interesting and is logic based. Makes it more appealing than financial problems for me.

pleasedontwatchthese
Автор

Keep them coming, while i'm technically minded its refreshing to see someone speak about this sort of stuff in terms people can understand, you are good on video too so fingers crossed for the channel!

GamersBar