LLM - Reasoning SOLVED (new research)

preview_player
Показать описание
Grokking transformers, a technique for infusing transformers also with near-perfect causal reasoning abilities. (Note: Grokking has nothing to do with Musk's AI Grok or Groq Inc. for fast inference.)

Grokking achieves this by enabling transformers to identify hierarchical structures within human sentences. Through extended training, the internal structure of the transformer undergoes a fundamental shift, allowing the formation of specific neural pathways called "generalizing circuits." These circuits are instrumental in efficiently encoding and retrieving knowledge for reasoning tasks. To create grokked transformers, several key elements are needed.

First, extensive training is essential, particularly for complex reasoning tasks that require structured knowledge. Second, the transformer architecture must have an optimal depth, balancing computational efficiency with reasoning performance. Third, a perfectly designed training dataset is crucial. This dataset should incorporate atomic facts and inferred facts, mimicking a formal system of axioms and theorems. Testing grokked transformers involves using out-of-distribution examples, which significantly differ from the training data. This helps assess the transformer's generalization capabilities.

Two tasks where grokked transformers excel are composition, where they outperform traditional methods that rely on external knowledge, and comparison, where they reason about similarities or differences between entities. The ratio of inferred to atomic data, the number of layers in the transformer, and the distribution of data within the training set all influence the grokking performance.

To understand how grokking transformers work, we can leverage techniques like logic lens, which analyzes internal activations to pinpoint which parts are involved in specific reasoning tasks, and causal tracing, which maps causal pathways through the transformer's layers. In conclusion, grokking transformers represent a promising approach to achieving near-perfect causal reasoning in large language models.

By meticulously designing training data, optimizing the architecture, and employing techniques like logic lens and causal tracing, we can unlock the potential of grokked transformers to tackle various reasoning challenges.

All rights w/ authors:
Grokked Transformers are Implicit Reasoners:
A Mechanistic Journey to the Edge of Generalization

#airesearch
#ainews
Рекомендации по теме
Комментарии
Автор

Indeed very good. It was under our nose all the time. Not sure why this research is only now being picked up. First papers on grokking are 2021, 2022 and partially earlier. Brigning this all together is very insightful. These series make me wat to set it up to play with 😂😂

mulderbm
Автор

The causal tracing highlights how similar NNs are to just applying input-sensitive matrix multiplication. In the case of ReLUs they're zero or linear, so it's like a hierarchical bunch of switches that turn on just the right linear transform on the input to get the output. The fact that this works (effective, trainable, interpolates and generalises) still amazes me!

luke.perkin.inventor
Автор

Completely blown away by the test with the "old model"...

alexjensen
Автор

The atomic facts on the graph at the 95% / 5% reminds me of the approach in reinforcement learning for physics models where you start with, for example, low gravity and high friction to dampen the system, then slowly increase/reduce each to bring it closer to reality. It makes unlearned high frequency chaotic (deterministic) systems learnable.

luke.perkin.inventor
Автор

Amazing stuff. We heard a few months ago about Q* and supposed advances in math ability at OpenAI on unreleased models, nothing of which has appeared in the public domain. This seems like real advances and is publicly accessible. Part of me thinks OpenAI puts out a lot of hype out there to keep the interest up, but their model still hallucinates like crazy, nothing as solid as this appears to be.

mlytle
Автор

This is a crazy good video keep it up! The Algorithm will pick this channel up in no time

MultiNiktar
Автор

How about this, first train a model for grokking just on pure logic dataset, randomly generated examples of logic (which should be easy to verify is correct), not language, just those stuff with letters and those weird symbols for logic gates/operators and so on; then once it groks it, move on the the next barebones level of mathematics, then climb up the math ladder at each grokking, at some point start including coding, physics, chemistry etc, and leave natural language for towards the end of the training ladder; ensuring the dataset for all steps follows the ideal ratio. Will we get an ASI that runs on RasPI with something like this approach?

TiagoTiagoT
Автор

Great video. If possible, make a lesson with python code. It would help to understand better how it works. This science is a deep ocean.

tmnmdpl
Автор

Reminds me of the Ten Thousand Hours rule for mastery of a subject

xenophobe
Автор

If these structures can be detected, surely they can be predicted? Can we build a model that will look at a dataset and output a good guess at what the weights of a grokked model would be? If so, maybe we can radically diminish the amount of computation required to achieve grokking? Perhaps even predict optimal cross layer memory sharing? I wonder if this might require spatial reasoning. Specifically a kind of self-reflective "imagining" of the model's blackbox architecture, as well as possible, and desirable structures within it?

LamontCranston-qhrv
Автор

This all sounds too good to be true. However the atomic / inferred knowledge thing is something I have had a gut feeling on for a long time.

Cant wait to replicate this on some easy tasks with continued pre-training.

lukeskywalker
Автор

Why do we have to exclude RAG from grokked LLMs? There is literally no reason why we can't RAG into a grokked LLM.

manslaughterinc.
Автор

What should i do in order to make an AI helper model in my pharmacology lab?

notaras
Автор

Thank you very much! I loved this presentation.

timgorn
Автор

What do they mean by sharing the information between the upper and lower layers? It's not clear to me how that is implemented. And that's kind of the key here.

acasualviewer
Автор

Anyone who has read the Law of One transmissions might recognize the principle of "intelligent infinity" operating here.

Daniel-Six
Автор

Sorry if you covered this in another video, but what's the difference between parametric and non parametric memory?

spkgyk
Автор

I think you have referred to the wrong paper at the bottom of your youtube summary. You mention a "metric", "structural grokking" and "tree structuredness." I cannot find the words "metric", "structural" or "tree" in the paper "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization" (arxiv 1405.24071), but all three of those terms are easy to find in "Grokking of Hierarchical Structure in Vanilla Transformers" (arxiv 2305.18741).

RalphDratman
Автор

Forcing features into the existing transformer architecture is a foolish idea when you can change its design to accommodate whatever features you need perfectly and fix all the known shortcomings.

GerardSans