Reading AI's Mind - Mechanistic Interpretability Explained [Anthropic Research]

Показать описание

Solving AI Doomerism: Anthropic's Research On AI Mechanistic Interpretability. This is a big first step into understanding what the underlying nodes within an AI model are actually "thinking".

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

This video is supported by the kind Patrons & YouTube Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO

[Music] massobeats - warmth
[Video Editor] @askejm

Рекомендации по теме

Комментарии

Not everyone appreciates it, but thanks for not dumbing it down too much!

dgram

I feel like ive been so focused on the output I haven't really stopped and thought about how the hidden layers in these models are even working. This was a great vid and makes me want to read into the mechanics of it all.

AkaThePistachio

A neural network that can help you understand neural networks that I can't understand at all.

elepot

We could train NNs to decode NNs but then we'd have to worry about mesa alignment. The dictionary approach seems like a good starting point indeed, but I wonder if there isn't some kind of annealing process we could impart during training such that the feature distribution isn't so arbitrary, but rather minimizes an energy function associated with the superposition states -- ie if one node can filter a feature it should have less energy than two nodes filtering the same feature, or likewise if one node is filtering multiple features it should have higher energy. I have no idea how this could be done however lol.

anywallsocket

Thank you. I believe your videos have acheived an excellent blend between easy to consume and high level learning. Although i did not understand everything you have helped enhance my vocabulary and understanding. Excellent teaching skills.

devaj

The fact that we can make safe AI doesn't mean we won't make unsafe one. On of the dangers of AI is AI weapons race.

DrSid

Thank you that now I have to watch this video on repeat for 2 hours to understand what the fuck is going on

Revan

I watched the whole video like I understood what bycloud was saying.

I still appreciate the video <3

AshT

Interpretability does not equal AI safety. It is only a solution to the alignment problem but doesn't do anything to prevent people from doing malicious things.

anhta

I would love to see this used on Larger Language Models. The idea that you can steer the network is powerful, could you imagine taking a really small toy model for programming languages and using the compiler and typehints to steer it allowing you to have a really tiny model that can perform as well as some of its larger realtives.

dihydromonoxide

Reminder that every step toward "safety" and "steerability" is also another step in the direction of humans - that is, specific groups of humans with very particular ideologies and political goals, having total control over the kinds of outputs models will put out.
Now, I'm happy about mechanical alignment specifically because it should theoretically allow companies to simply flat-out remove dangerous information the model doesn't need to have, but realistically I don't see the big players taking a less heavy-handed approach to alignment any time soon. They *want* models to be not just law-abiding but 'moral' according to their view of what it means to be moral.

YUTPIA

It feels that people rely too much on simple code that makes complex models, when complex code that makes a simple (comprehensable) model would be preferred.

BradleyZS

bro, continue to make the videos above 20 min, they are teh few videos that i wacht every week.

renanleao

Pfft I was proud I could understand 70% of it.

canygard

you use the term superposition, do you mean it in the same way that quantum entanglement gives extra degrees of freedom when in superposition, or is this just a parallel analogy only?

that is are you suggesting the nets exibit quantum like superpositions or quatum exact superpositions? seemsso close it is symatics, but basically your probing the neural network just as one would sample a quantum entangled particle with only enough energy to keep it from collapsing, or rather in your example you are collapsing them into single state?

wanfuse

Knowing that there are many others like me relaxes me

enesmahmutkulak

WELL DONE SIR WELL DONE! AUTOMATONS! ANALOG COMPUTING! AND THE LIBRARY! YES! THAT IS IT, THERE IT IS, THE 4TH WALL BREAK

kitchenpotsnpans

What was the last big thing that was going to change the world- the Internet? And a what cesspool of criminal activity that has now turned into.

cpuuk

"If we can fully understand and interpret AI networks we can pretty much fully guarantee AI safety"

I mean, no, but it is a really useful step towards AI safety.

hillosand

Whether the authors realize it or not, this approach to XAI is structuralism in disguise. When you take into account the volume of data created by LLMs it’s also subject to feedback loops. It also does not work well for some languages, like those with drop pro grammar, or when analyzing data in contexts where meaning is underdetermined. If the approach is used to encourage/regulate public consensus on meaning by gradually binding it to mechanistic interpretability, then in effect it would facilitate cultural engineering … though maybe unintentional.

In language, ambiguity from polysemy/superposition is a feature, not a bug. Someone who strongly objects to that is probably a lawyer. Languages/dialects have evolved to facilitate ambiguity, since it serves a purpose in communication. How this method works in LLMs that are translingual would be interesting.

It’s not clear whether this method scales. It’s probably more efficient than Shapley values though that’s not saying much.

It is interesting though, so maybe I’ll read the paper.

DavidConnerCodeaholic

Reading AI's Mind - Mechanistic Interpretability Explained [Anthropic Research]

How AIs, like ChatGPT, Learn

What Is AI? | Artificial Intelligence | What is Artificial Intelligence? | AI In 5 Mins |Simplilearn

How an Addicted Brain Works

The Science of Laziness

What Is Intelligence? Where Does it Begin?

How does artificial intelligence learn? - Briana Brownell

But what is a neural network? | Deep learning chapter 1

The Neuroscience of Learning

What is Artificial Intelligence? | Quick Learner

How to Memorize Fast and Easily

Why Mind Wandering Is Bad For You and How to Stop It

Your Body's Molecular Machines

ACM AI | Reading Group W24W5 | Mechanistic Interpretability: Understanding Transformer Circuits

Neel Nanda: Mechanistic Interpretability & Mathematics

Noam Chomsky on Language Aquisition

This Is Your Child's Brain on Videogames | WSJ

What is a Thought? How the Brain Creates New Ideas | Henning Beck | TEDxHHL

DNA Replication (Updated)

Attention Mechanism in AI Systems and its Relevance for Textual Pragmatics.

The language of lying — Noah Zandan

The Law of Attraction Explained

Psychedelics: a trip from mechanism to therapeutic potential?

Mechanistic Interpretability for AI Alignment with Callum McDougall

2-Minute Neuroscience: Autism