Reading AI's Mind - Mechanistic Interpretability Explained [Anthropic Research]

preview_player
Показать описание

Solving AI Doomerism: Anthropic's Research On AI Mechanistic Interpretability. This is a big first step into understanding what the underlying nodes within an AI model are actually "thinking".

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

This video is supported by the kind Patrons & YouTube Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO

[Music] massobeats - warmth
[Video Editor] @askejm
Рекомендации по теме
Комментарии
Автор

Not everyone appreciates it, but thanks for not dumbing it down too much!

dgram
Автор

I feel like ive been so focused on the output I haven't really stopped and thought about how the hidden layers in these models are even working. This was a great vid and makes me want to read into the mechanics of it all.

AkaThePistachio
Автор

A neural network that can help you understand neural networks that I can't understand at all.

elepot
Автор

We could train NNs to decode NNs but then we'd have to worry about mesa alignment. The dictionary approach seems like a good starting point indeed, but I wonder if there isn't some kind of annealing process we could impart during training such that the feature distribution isn't so arbitrary, but rather minimizes an energy function associated with the superposition states -- ie if one node can filter a feature it should have less energy than two nodes filtering the same feature, or likewise if one node is filtering multiple features it should have higher energy. I have no idea how this could be done however lol.

anywallsocket
Автор

Thank you. I believe your videos have acheived an excellent blend between easy to consume and high level learning. Although i did not understand everything you have helped enhance my vocabulary and understanding. Excellent teaching skills.

devaj
Автор

The fact that we can make safe AI doesn't mean we won't make unsafe one. On of the dangers of AI is AI weapons race.

DrSid
Автор

Thank you that now I have to watch this video on repeat for 2 hours to understand what the fuck is going on

Revan
Автор

I watched the whole video like I understood what bycloud was saying.

I still appreciate the video <3

AshT
Автор

Interpretability does not equal AI safety. It is only a solution to the alignment problem but doesn't do anything to prevent people from doing malicious things.

anhta
Автор

I would love to see this used on Larger Language Models. The idea that you can steer the network is powerful, could you imagine taking a really small toy model for programming languages and using the compiler and typehints to steer it allowing you to have a really tiny model that can perform as well as some of its larger realtives.

dihydromonoxide
Автор

Reminder that every step toward "safety" and "steerability" is also another step in the direction of humans - that is, specific groups of humans with very particular ideologies and political goals, having total control over the kinds of outputs models will put out.
Now, I'm happy about mechanical alignment specifically because it should theoretically allow companies to simply flat-out remove dangerous information the model doesn't need to have, but realistically I don't see the big players taking a less heavy-handed approach to alignment any time soon. They *want* models to be not just law-abiding but 'moral' according to their view of what it means to be moral.

YUTPIA
Автор

It feels that people rely too much on simple code that makes complex models, when complex code that makes a simple (comprehensable) model would be preferred.

BradleyZS
Автор

bro, continue to make the videos above 20 min, they are teh few videos that i wacht every week.

renanleao
Автор

Pfft I was proud I could understand 70% of it.

canygard
Автор

you use the term superposition, do you mean it in the same way that quantum entanglement gives extra degrees of freedom when in superposition, or is this just a parallel analogy only?

that is are you suggesting the nets exibit quantum like superpositions or quatum exact superpositions? seemsso close it is symatics, but basically your probing the neural network just as one would sample a quantum entangled particle with only enough energy to keep it from collapsing, or rather in your example you are collapsing them into single state?

wanfuse
Автор

Knowing that there are many others like me relaxes me

enesmahmutkulak
Автор

WELL DONE SIR WELL DONE! AUTOMATONS! ANALOG COMPUTING! AND THE LIBRARY! YES! THAT IS IT, THERE IT IS, THE 4TH WALL BREAK

kitchenpotsnpans
Автор

What was the last big thing that was going to change the world- the Internet? And a what cesspool of criminal activity that has now turned into.

cpuuk
Автор

"If we can fully understand and interpret AI networks we can pretty much fully guarantee AI safety"

I mean, no, but it is a really useful step towards AI safety.

hillosand
Автор

Whether the authors realize it or not, this approach to XAI is structuralism in disguise. When you take into account the volume of data created by LLMs it’s also subject to feedback loops. It also does not work well for some languages, like those with drop pro grammar, or when analyzing data in contexts where meaning is underdetermined. If the approach is used to encourage/regulate public consensus on meaning by gradually binding it to mechanistic interpretability, then in effect it would facilitate cultural engineering … though maybe unintentional.

In language, ambiguity from polysemy/superposition is a feature, not a bug. Someone who strongly objects to that is probably a lawyer. Languages/dialects have evolved to facilitate ambiguity, since it serves a purpose in communication. How this method works in LLMs that are translingual would be interesting.

It’s not clear whether this method scales. It’s probably more efficient than Shapley values though that’s not saying much.

It is interesting though, so maybe I’ll read the paper.

DavidConnerCodeaholic