Anthropic's New Mech-Interp Paper, A Deep Dive

Показать описание

Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!

Discuss this stuff with other Tunadorks on Discord

All my other links

Tunadorable

Рекомендации по теме

Комментарии

It makes sense that an autoencoder would perform a kind of PCA. I just never considered that before. Good job!

dr.mikeybee

That error feature is fascinating. I've been thinking that reasoning is sharded in functional areas. This finding suggests that the notion of error has been abstracted and parsimony is being optimized.

dr.mikeybee

That technique seems so powerful. Thanks for the overview.

RickeyBowers

If I understand what is being said, feed forward fully connected neural nets have n-1 diagonal paths where n is the layer dimension and one orthogonal path for every node. There is no up and down -- only forward.

dr.mikeybee

I laughed a lot at the idea of a LLM which is hopelessly obsessed with The Golden Gate Bridge, and can't think about anything else.

andybrice

I think they mean that salience can be superpositional. In other words a single weight doesn't have a single purpose. It has different salience depending on other weights along activation paths.

dr.mikeybee

Excellent video! Superb job summarizing the blog!

Anonymous-lwzy

Robert_AIZI just posted "Comments on Anthropic's Scaling Monosemanticity", a key point he makes is that these features only represent what autointerp names them when they're particularly high magnitude; when eg the golden gate feature is lower magnitude, we can't necessarily assume it's strictly a golden gate bridge feature - polysemanticity would be expected to be higher the lower magnitude a feature is.

laurenpinschannels

@40:37
"Concepts related to entrapment, containment, or being trapped or confined within something like a **bottle** or frame"

This makes the analogy of AI as a genie hit different for me

preston_is_on_youtube

It's fascinating that semantic space has a shape. Cultural differences aside, the semantic space's shape for various languages should be the same. I've never heard anyone say this before. You understand spaces very well.

dr.mikeybee

Thanks for sharing this! Actually gives me hope for the future that we may be able to get a handle on this out of control AI development situation!

themeeseman

I think the reason they use the middle layer is because it can be furthest from the token embeddings? The last layer and the first layer have to be more directly connected to the embeddings of the tokens, right?

drdca

the "weird names" they chose are not weird, they are the names we use for prototype functions in python since python. This is the domain knowledge problem that AI isnt going to solve for people without domain knowledge.

Joviex

Taking all these papers and asking ChatGPT to explain them to me, a non-expert.

Yarrottogon-Project

Regarding the LLMs hateful/racist rants and guilt, if a similar process occurs within people then we know which are the most racist:
The ones most suffering from white guilt 😂

TomM-po

Anthropic's New Mech-Interp Paper, A Deep Dive

Anthropic's New Mech-Interp Paper, A Deep Dive

Anthropic Solved Interpretability?

Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podca...

Anthropic Unlocks the Mystery of LLMs

Scaling interpretability

Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability

What is mechanistic interpretability? Neel Nanda explains.

Quantum entanglement explained by Neil deGrasse Tyson with Joe Rogan #shorts

Mechanistic Interpretability explained | Chris Olah and Lex Fridman

SLT Summit 2023 - Toy Models of Superposition (Mech Interp 1)

Open Problems in Mechanistic Interpretability: A Whirlwind Tour

SLT Summit 2023 - Induction Heads and Phase Transitions (Mech Interp 2)

Hella New AI Papers - Aug 9, 2024

A Walkthrough of Toy Models of Superposition w/ Jess Smith

9. Wojciech Lesicki and Andrzej Agria: Attacking and Defending LLMs in Production Environments

Concrete open problems in mechanistic interpretability | Neel Nanda | EAG London 23

Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability

Open Problems in Mechanistic Interpretability: A Whirlwind Tour | Neel Nanda | EAGxVirtual 2023

Neel Nanda: Mechanistic Interpretability & Mathematics

How might LLMs store facts | DL7

INTERVIEW: Applications w/ Alice Rigg

Why US AI Act Compute Thresholds Are Misguided...

Popular Mechanistic Interpretability: Goodfire Lights the Way to AI Safety

Mechanistic Interpretability for AI Alignment | Callum McDougall, Joseph Bloom | EAGxBerlin 2023