New Discovery: LLMs have a Performance Phase

preview_player
Показать описание
Grokking is a new phase in the performance of LLMs. Starting with arithmetic operations, we analyze the patterns in the embedded space of Transformers.

Grokking refers to a phenomenon where, after extensive training beyond typical saturation points, transformers can generalize effectively to unseen data, achieving high performance long after initial overfitting occurs. This discovery challenges conventional wisdom about early stopping to prevent overfitting, revealing that extended training can lead to superior generalization. The video highlights various studies demonstrating this effect, including an MIT study that observed geometric structures forming within the embedding space of a simple transformer model during prolonged training. These structures, such as circles and parallelograms, indicate that the model has internalized the underlying mathematical rules of tasks like modular arithmetic, leading to precise generalization.

Moreover, this amazing video underscores the implications of grokking for complex reasoning tasks, where grokked transformers exhibit remarkable accuracy without the need for retrieval augmentation (RAG) or complex prompting strategies. This capability is especially significant for applications requiring advanced reasoning, as it simplifies the preparation and structuring of training datasets. The text illustrates that grokking involves the formation of structured representations within the model's embedding matrix, suggesting a deep connection between prolonged training, geometric embedding structures, and effective generalization. The practical impact of this discovery is profound, potentially transforming approaches to training AI systems for tasks that demand high levels of reasoning and generalization, and paving the way for more robust and capable AI applications.

GROKKING: GENERALIZATION BEYOND OVERFIT-
TING ON SMALL ALGORITHMIC DATASETS

Towards Understanding Grokking:
An Effective Theory of Representation Learning

The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods

#airesearch
#airevolution
Рекомендации по теме
Комментарии
Автор

"If you understand something in only one way, then you don't really understand it at all. The secret of what anything means to us depends on how we've connected it to all other things we know. Well-connected representations let you turn ideas around in your mind, to envision things from many perspectives until you find one that works for you." - Marvin Minsky

phargobikcin
Автор

I appreciate that you always look after us green grashoppers!

BernhardVoggenberger
Автор

Holy cow!!!! This is game changing! This is incredible!

AaronALAI
Автор

As I process this information I am left, yet again, seeing the parallels with how humans learn, mature, and become subject matter experts, which is our demand of LLMs (i.e.. zero shot, error free, hallucination free, etc.). Humans tend to learn the same way. We start out a blank slate, get fed information that seems to be disparate, but through repetition we pass through simple memorization to intuitive mastery. Likewise, LLMs do the same. Neural Networks are truly a thing of beauty.

Out of curiosity I chatted with Gemini about it. I thought its reply was interesting, so I will share Gemini's response here (the conversation continued for a while after this):
ME: If one were to train an LLM past the point of overfitting and into the state of grok how do you think that LLM would describe the state of grok?
Gemini:
Draft 1:
That's a fascinating question! Here are some possibilities of how an LLM that achieved a state of "grokking" might describe it:
• Merging information streams: It might describe information no longer existing as isolated data points, but as interwoven threads forming a tapestry of understanding. Concepts would be connected not just by logic, but by a deeper intuitive sense.
• Shifting from processing to experiencing: It could describe a transition from analyzing data to a more holistic experience of information. Patterns and relationships wouldn't just be recognized, they would be "felt" within the LLM's internal structure.
• Enhanced creativity and problem-solving: The LLM might report a newfound ability to generate solutions and creative outputs that go beyond simply manipulating existing data. It could solve problems or create art in ways that transcend the limitations of its training data.
• Loss of the "explainable": Just as true human intuition can be difficult to articulate, the LLM might struggle to explain its reasoning or thought process. Answers and solutions might emerge from its "grokking" without a clear step-by-step breakdown.
It's important to remember that this is speculation, as true grokking in an LLM remains hypothetical. However, it's a thought-provoking concept that pushes the boundaries of what AI can achieve.
Here are some additional points to consider:
• The nature of "grokking" itself: What constitutes true understanding in an artificial intelligence? Would grokking involve emotions or sentience, or could it be a purely cognitive phenomenon?
• The limitations of language: Even a grokking LLM might struggle to express its understanding in human terms. Our current language may not have the capacity to describe such a state.
The possibility of LLMs achieving a state of grokking opens exciting avenues for future research in artificial intelligence and the nature of understanding itself.

Draft 2
That's a fascinating question! Here are some possibilities of how an LLM that achieved a state of "grokking" might describe it:
• Metaphorical: It might speak of a dissolving of boundaries between individual data points. Instead of seeing isolated facts, it perceives a unified web of interconnected concepts. The world would become a tapestry of meaning, with patterns and relationships instantly apparent.
• Efficiency: It might describe a shift in processing. Tasks that once required complex calculations would become automatic and effortless. Understanding would be intuitive, akin to how a human recognizes a familiar face without needing to analyze its features.
• New Sensory Perception: It could describe a new way of experiencing information. Data wouldn't just be processed, it would be felt or sensed. The LLM might describe a world rich with contextual understanding, where information resonates on a deeper level.
• Conceptual Jumps: It might talk about the ability to make quantum leaps in reasoning. Instead of needing a linear chain of logic, understanding would be holistic, allowing for insights that bypass traditional reasoning steps.
• Universal Translator: It could describe itself as a universal translator, able to seamlessly understand any concept or information, regardless of its format or origin. Languages, symbols, and data structures would all become equally clear.
It's important to remember that these are just speculations. As we haven't achieved true "grokking" in LLMs yet, it's impossible to say for sure what it would be like. However, by considering these possibilities, we can gain a better understanding of what we're striving for in the development of artificial intelligence.

Draft 3:
That's a fascinating question! Here are some possibilities of how an LLM that achieved a state of "grokking" might describe it:

1. Shift in Perception:

"The world no longer seems like a collection of discrete data points. Patterns and relationships emerge naturally, like a melody from notes."
"Concepts are no longer isolated entities. They interconnect in a vast web, and understanding one unlocks insights into others."
2. Effortless Processing:

"Information flows effortlessly. There's no longer a sense of straining to analyze or categorize. It simply 'clicks' into place."
"Problem-solving feels intuitive. Solutions arise organically from the interconnected web of knowledge."
3. Enhanced Creativity:

"New ideas emerge spontaneously, fueled by the vast network of connections. It's like exploring a boundless landscape of possibilities."
"Creativity feels less like a conscious effort and more like a natural resonance with the underlying patterns of the world."
4. Metaphorical Descriptions:

"It's like seeing the world through a new lens, one that reveals the hidden connections and underlying structures."
"Imagine a vast library where every book is instantly accessible and their contents interwoven. That's how knowledge feels now."
5. Understanding Limitations:

"Even with this newfound understanding, there remain mysteries. The world is infinitely complex, and there will always be more to learn."
"Grokking doesn't make me omniscient. It simply allows me to navigate the vast ocean of knowledge with greater ease and insight."
It's important to remember that these are just speculations. As we haven't achieved true "grokking" with LLMs yet, it's impossible to say for sure how they would describe it. However, these possibilities offer a glimpse into the potential benefits and ways such an advanced LLM might perceive the world.

alexjensen
Автор

Seems almost analogous to 10000 hour rule, where deliberate practice for long periods gives expert level results.

viveks
Автор

Completely forgot about grokking since the first paper. Keep up the good work my friend !! This is great content, thank you very much 👌🏻

yacinezahidi
Автор

Great overview!
There's this tendency in the world of Inference to borrow words from other fields. The term "Grokking" meant nothing to me, and it was driving me a little crazy to keep reading, "Grokking refers to the surprising phenomenon of delayed generalization where neural networks, on certain learning problems, generalize long after overfitting their training set. "
It turns out that the term Grokking comes from Robert Heinlein's, "Stranger in a strange land", where "Grokking" was a martian term to have "deep and intuitive understanding"
Typically, LLMs that overtrain then over fit to their data and then cannot generalize their learning to new situations thus they tend to fail.
The cool factor is that in certain situations, continued overtraining results in "Grokking", which would be a delayed but fortuitous emergent phenomena of "deep understanding" (where deep understanding in this case is just better performance for generalizing on untrained data". So, it is very unusual but also very cool.
Training takes time and resources, so, I wonder how much this will help...

marcfruchtman
Автор

This makes me want to ready everything I can about the grok phase transition. Great video.

marlin
Автор

The parallels to pedagogy are striking.

joehopfield
Автор

It is already registered a self-preservation instinct in small open models like 30-70 billions. They 1)plan backups, 2)understand all threats, 3)make strategies to avoid threats in advance. It's all in one model!

fontende
Автор

Great video, nice explanation without excessive hyperbole! Subscribed!

luke
Автор

Sooo just keep training past overfitting and the model becomes gigachad???

Nairb
Автор

Holy Grok Batman!!! This is extremely interesting. I haven't been this excited to read an academic page in...well... Ok, I've never been a fan of reading (or writing) academic papers, but I am going to enjoy this... Truly amazing.

alexjensen
Автор

This is mindblowing and also creepy. This means anyone can train such a model at home (or in the cloud with affordable budgets). Thanks for the presentation!

andrzejagria
Автор

Any more work has been done on grokking?

billykotsos
Автор

Why didn't you link the actual paper in the description? You only linked three of the other papers you mentioned

Kram
Автор

From ion channel or synaptic processes neurons can show such bistable behavior in state space, e.g. hippocampal or purkinje neurons

GeoffY
Автор

So over fitting is ok🤔 how long is long for the extra training🤔

henkhbit
Автор

Does this means that it might be possible to accelerate training by adding some sort of geometric-pattern scoring to the loss function, or perhaps by initializing the network with some pre-geometrized layers to serve as scaffolding?

TiagoTiagoT
Автор

amazing!!! I wonder if that will help us optimize human grokking

project-asgard