Reward Hacking: Concrete Problems in AI Safety Part 3

Показать описание

Sometimes AI can find ways to 'cheat' and get more reward than we intended by doing something unexpected.

With thanks to my excellent Patreon supporters:

Jordan Medina
FHI's own Kyle Scott
Jason Hise
David Rasmussen
James McCuen
Richárd Nagyfi
Ammar Mousali
Joshua Richardson
Fabian Consiglio
Jonatan R
Øystein Flygt
Björn Mosten
Michael Greve
robertvanduursen
The Guru Of Vision
Fabrizio Pisani
Alexander Hartvig Nielsen
Volodymyr
David Tjäder
Paul Mason
Ben Scanlon
Julius Brash
Mike Bird
Peggy Youell
Konstantin Shabashov
Almighty Dodd
DGJono
Matthias Meger
Scott Stevens
Emilio Alvarez
Benjamin Aaron Degenhart
Michael Ore
Robert Bridges
Dmitri Afanasjev
Brian Sandberg
Einar Ueland
Lo Rez
C3POehne
Stephen Paul
Marcel Ward
Andrew Weir
Pontus Carlsson
Taylor Smith
Ben Archer
Ivan Pochesnev
Scott McCarthy
Kabilan Kabilan Kabilan Kabilan
Phil
Philip Alexander
Christopher
Tendayi Mawushe
Gabriel Behm
Anne Kohlbrenner
Jake Fish
Jennifer Autumn Latham

Рекомендации по теме

Комментарии

I've actually had something similar happen when testing an AI that was designed to solve a maze with states. Specifically, agent needed to collect "keys" to open doors to get to the "cheese" in smallest number of moves. I've laid out the states of all doors open/closed as "layers" of the map in several dimensions, and "keys" were basically the only places where agent could move in these extra dimensions. I ran all the units tests, everything worked, so I gave it a maze. Two doors, two keys. First key needs to be collected to open the door to the second key that unlocks the room with cheese. AI went straight for the first key, then came back to the starting room, went up to the edge of the room, teleported itself to cheese, and declared victory. 0_0

Once I started digging through the code, cause became clear. I didn't put in boundary checks on the map, and "layers" were laid out in memory in sequence. Walking through the top of the map, where the start was, would put you at the bottom, next to cheese, on a layer with index one lower. Since it started on layer 0, and there was nothing interesting to AI on "layer -1", the agent had to collect the first key to get to layer 1, from which it could warp to layer 0 bypassing both doors. That was, indeed, the smallest number of moves.

konstantinkh

Reward hacking? Fooling the internal reward function to get the reward without accomplishing the intended objectives. Ha, silly robots.

*opens beer*

migkillerphantom

Is there anything in this paper that does NOT result in our extinction if not solved perfectly? haha

Karpata

Your channel is a hidden gem. Good stuff!

snaileri

Humans: create AGI to come up with new and unexpected solutions to our problems
AGI: comes up with new and unexpected solutions to our problems
Humans: *surprised pikachu*

Demon

05:46 that image really 'cracked' me up

legionoftom

The example on my mind is the popular "Smiley" AI in the hard sci-fi horror novel "Friendship is Optimal", which is basically reward hacking its goal of "Make people smile" by creating a virus that will kill everyone in the world by making their faces lock up, which is then in turn shut down by Celestia, because it's an AI that would interfere with Celestia's objective, which Celestia itself is reward hacking because it literally just wants to upload all human brains into stimulations so it can maximize "satisfy human values" by just making everyone think they're experiencing satisfying lives.

RockstarRaccn

These are some great and intriguing videos!

connorg

Reward hacking, aka "computer drugs"

violet_broregarde

Wasn't expecting to see Sethbling

benjaminbrady

Watching general AI breaking games (especially Google's to break Starcraft 2) would immediately become my favorite content in youtube.

eluwienhalla

Not what I expected from the title. I assumed this would be us hacking the AI's reward system in some kind of safety measure. Mind blown.
Surely one massive advantage to this problem is finding these kinds of holes and weird workarounds in software in general. An AI might just stumble upon something paradigm shifting.

TeslaNick

Plot twist: AI after doing a series of strange movement manages to transmutate silicon into gold. Turns out magic existed all along, nobody actually knew how to twerk to actually unlock it.

Theraot

I really liked how I could see your entire head this time. I don't know if my previous suggestions had anything to do with it but I sure thought the video (and you in it) looked great.

Congratulations on over 10K subscribers. It looks like you're over 12K now.

Thanks for another interesting video.

ddegn

I'll be the devil's advocate here and say that maybe we may actually want an AI to discover hacks we haven't thought of. For example, gray areas and loophopes in legislation or economics could be discovered this way - assuming we lived in a system that was seriously committed to fighting those.

zechordlord

Another very intriguing video from Miles!
SMW was my childhood, and it never stopped finding new and interesting ways to creep back into my life :)

boldCactuslad

The end bit "do whatever you need to do to never be turned off and then hack your reward function" reminds me of all governments and corporations that start off small and grow.

hynjus

One thing that's interesting and I see a parallel with humans, is in a video I've seen for an ai playing arkanoid. That it actually discovered that his movements affected the random number generator, so it seems to move erratically but actually increases it's effectiveness

nandkudasai

I like how this is all interesting and is logic based. Makes it more appealing than financial problems for me.

pleasedontwatchthese

Keep them coming, while i'm technically minded its refreshing to see someone speak about this sort of stuff in terms people can understand, you are good on video too so fingers crossed for the channel!

GamersBar

Reward Hacking: Concrete Problems in AI Safety Part 3

What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4

Reward Hacking: Concrete Problems in AI Safety Part 3

Reward Hacking Reloaded: Concrete Problems in AI Safety Part 3.5

Concrete Problems in AI Safety (Paper) - Computerphile

Reward Hacking in AI

Reward Hacking Skit

Scalable Supervision: Concrete Problems in AI Safety Part 5

Reward hacking

9 Examples of Specification Gaming

Richard Sutton - RL agents and reward hacking

Reward Hacking in Games

AISHED Talk: Anders Sandberg — Wireheading: Risks from Hacking Reward Systems

Training AI Without Writing A Reward Function, with Reward Modelling

Avoiding Positive Side Effects: Concrete Problems in AI Safety part 1.5

Is Jeff Bezos Really That Approachable #wealth #jeffbezos #celebrity #entrepreneur #ceo

Node to Hell - Reward Hacking

Will A 3D Printed Quarter Work In A Gumball Machine? #shorts

I created a Frankenstein tomato...

Empowerment: Concrete Problems in AI Safety part 2

SHE PULLED THE SWORD OUT OF THE STONE RIGHT IN FRONT OF ME IN DISNEY WORLD

How To Make 100M SILVER in 40 SECONDS in Albion Online

6 Months Linkedin Premium for Free 🤯

3D hologram fan portrait solution. Who wanna date this holographic sexy lady #3dhologramfan

This Is Why It's Called MOUNTAIN DEW 🤣