Anthropic's New Mech-Interp Paper, A Deep Dive

preview_player
Показать описание

Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!

Discuss this stuff with other Tunadorks on Discord

All my other links
Рекомендации по теме
Комментарии
Автор

It makes sense that an autoencoder would perform a kind of PCA. I just never considered that before. Good job!

dr.mikeybee
Автор

That error feature is fascinating. I've been thinking that reasoning is sharded in functional areas. This finding suggests that the notion of error has been abstracted and parsimony is being optimized.

dr.mikeybee
Автор

That technique seems so powerful. Thanks for the overview.

RickeyBowers
Автор

If I understand what is being said, feed forward fully connected neural nets have n-1 diagonal paths where n is the layer dimension and one orthogonal path for every node. There is no up and down -- only forward.

dr.mikeybee
Автор

I laughed a lot at the idea of a LLM which is hopelessly obsessed with The Golden Gate Bridge, and can't think about anything else.

andybrice
Автор

I think they mean that salience can be superpositional. In other words a single weight doesn't have a single purpose. It has different salience depending on other weights along activation paths.

dr.mikeybee
Автор

Excellent video! Superb job summarizing the blog!

Anonymous-lwzy
Автор

Robert_AIZI just posted "Comments on Anthropic's Scaling Monosemanticity", a key point he makes is that these features only represent what autointerp names them when they're particularly high magnitude; when eg the golden gate feature is lower magnitude, we can't necessarily assume it's strictly a golden gate bridge feature - polysemanticity would be expected to be higher the lower magnitude a feature is.

laurenpinschannels
Автор

@40:37
"Concepts related to entrapment, containment, or being trapped or confined within something like a **bottle** or frame"

This makes the analogy of AI as a genie hit different for me

preston_is_on_youtube
Автор

It's fascinating that semantic space has a shape. Cultural differences aside, the semantic space's shape for various languages should be the same. I've never heard anyone say this before. You understand spaces very well.

dr.mikeybee
Автор

Thanks for sharing this! Actually gives me hope for the future that we may be able to get a handle on this out of control AI development situation!

themeeseman
Автор

I think the reason they use the middle layer is because it can be furthest from the token embeddings? The last layer and the first layer have to be more directly connected to the embeddings of the tokens, right?

drdca
Автор

the "weird names" they chose are not weird, they are the names we use for prototype functions in python since python. This is the domain knowledge problem that AI isnt going to solve for people without domain knowledge.

Joviex
Автор

Taking all these papers and asking ChatGPT to explain them to me, a non-expert.

Yarrottogon-Project
Автор

Regarding the LLMs hateful/racist rants and guilt, if a similar process occurs within people then we know which are the most racist:
The ones most suffering from white guilt 😂

TomM-po