Why 'Grokking' AI Would Be A Key To AGI

preview_player
Показать описание

Check out my newsletter:

Are We Done With MMLU?

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

This video is supported by the kind Patrons & YouTube Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Robert Zawiasa, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi, Hector, Drexon, Claxvii 177th, Inferencer, Michael Brenner, Akkusativ, Oleg Wock, FantomBloth

[Music] massobeats - noon
[Video Editor] Silas
Рекомендации по теме
Комментарии
Автор

So essentially Grokking is just making the AI bash its head into a concept over and over and over again until it finally understands it. Guys, I think I've been grokking for years.

FireFox
Автор

New pitch to investors: "Just 10x more GPUs bro, 10x GPUs to push past this stagnant validation and we will reach grokking, I promise"

dhillaz
Автор

Putting out 3 videos in 4 days that are so complex and labor intensive is crazy. Love the vids man, keep up the grind

manuelburghartz
Автор

So basically, it's like how if we read something 1 or 2 times, we might remember the answer, but if we read it or run it through our head 20x, we are more likely to fully understand the logic behind what we're reading. Given that higher parameter LLMs are more expensive to do this with, I wonder if small models will actually be the most capable in the long run.

blisphul
Автор

I love how I have to see your videos 3 times to understand them, kinda like when I was first starting out with calculus!

cdkw
Автор

this fourier transform filter thing is just nuts. when i see stuff like this or alphaproof or PRMs i can't imagine we wouldn't reach huge intelligence leaps beyond AGI in the next 5 years. i mean all it takes is a simple mathematical trick, this is the level of infancy that AI is currently in. i mean you look at other fields of science like fucking material science and there even in the 60s just to figure out materials for LEDs, they would go through an order of magnitude more struggle than for the simple AI breakthroughs of this years papers. or look at physics, space, semiconductors. and AI on a software level is so much easier to experiment with than those things

JazevoAudiosurf
Автор

Grokking is an instance of the minimum description length principle.
If you have a problem, you can just memorize a point-wise input to output mapping.
This has zero generalization.
But from there, you can keep pruning your mapping, making it simpler, a.k.a. more comrpessed.
The program that generalizes the best (while performing well on a training set), is the shortest.
→ Generalization is memorization + regularization.
But this is of course still limited to in distribution regularizatiom.

-mwolf
Автор

One of the best channels out there that has real value to everyone. You NEED a raise!

PYETech
Автор

I find that alice in wonderland type responses can be significantly improved when system prompting to form data structures from the known data and then inferring from that structure - something like this (a minimal version)

```
You are tasked with solving complex relationship questions by first mapping all known facts into a JSON structure and then using this structure to infer answers. When given a question, follow these steps:
1. Extract all given facts.
2. Create a JSON structure to represent these facts.
3. Use the JSON structure to navigate and infer answers.
4. Provide clear and logically consistent responses based on the JSON file.
```

I used this technique very successfully when working with gossip analysis and determining the source of gossip but quickly realized its benefits in other logical fields.

Gee
Автор

Grokking is akin to the evolution of language even after a population has been entirely alphabetized. Every once in a while someone figures out a new connection between seemingly unrelated concepts, uses one of them in a new context by mistake or because it forgot the intended word etc. This continuous increase in information entropy even after exhausting the parameter space reminds me a lot of what some scientists say about information in degenerate era's black holes.

rubncarmona
Автор

Grokking our way to AGI...
11:33 the Grokfast paper has boggling potential, so who's using it? when will we see its results?

ytubeanon
Автор

1. **Current State of LM Benchmarks**
*Timestamp: **0:00:00*

2. **Benchmark Performance Issues**
*Timestamp: **0:00:03*

3. **Implications of Reordering Questions**
*Timestamp: **0:00:20*

4. **Alice in Wonderland Paper Findings**
*Timestamp: **0:01:57*

5. **The Concept of Grocking**
*Timestamp: **0:04:27*

6. **Grocking vs. Double Descent**
*Timestamp: **0:06:06*

7. **Potential Solutions for Improved Reasoning**
*Timestamp: **0:09:02*

8. **Grocking in Transformers and New Research**
*Timestamp: **0:08:12*

9. **Grock Fast Implementation**
*Timestamp: **0:11:28*

### Ads
1. **Hub SWAT AI Resources**
*Timestamp: **0:01:13*

### Funny Jokes
1. **"Absolute dog water"**
*Timestamp: **0:00:03*

2. **"Kind of crazy from a more cynical and critical perspective"**
*Timestamp: **0:00:33*

3. **"Can you imagine an AI being able to do this only humans would be able to come up with something this random and absurdly funny"**
*Timestamp: **0:03:13*

4. **"If an AI can truly do this it actually might be so over over for us so for the sake of burning down rainforests"**
*Timestamp: **0:03:23*

5. **"Elon's Grock LM is probably named after the book and not related to the ML concept that we are talking about today"**
*Timestamp: **0:05:43*

6. **"Mr. Zuck said that when llama 370 never stopped learning even after they trained it three or four times past the chinella optimum is not copium"**
*Timestamp: **0:10:03*

dzbuzzfeed
Автор

I’ve been saying this for a year. Other researchers keep foolishly positioning grokking as a weird training artifact without practical value. When there is literally research to the contrary, yet they still see no value lol. Almost like common sense to me. Imagine going through school with no context, no homework, no tutoring and producing current SOTA LM benchmarks. The fact LM can with severely oblique data makes the answer clear. Hybrid data. Increasing data density with inferred facts. Remember reasoning is basically syntactic transformation. Reformulating samples using formal semantics for native symbolic reasoning is the answer. Clear as day. Also fixing PE to solve the reversal curse. All you need.


As someone who trained smaller model at 12k tokens per parameter without any real saturation. Models first off should be way smaller. Then focus on hybrid data. AGI will be compact in my personal opinion. For instance I believe a 10B model can exceed gpt4 using the formula I described above. Since imo it should be trained on 100T tokens lol. Models are vastly overparameterized and it’s so idiotic to me. Brilliant engineers but their first principles are wrong.

Grokfast is super important but you have to modify the code to work with larger models. FYI deeper layer wanna grokk more than toy models seen in research.

alexanderbrown-dgsy
Автор

No shit! i've literally made code for this idea!!! I can't believe someone was working on this like i was, i didnt even know what was grokking.

kazzear_
Автор

Important to remember that chinchilla compute optimal is not inference optimal

iantimmis
Автор

Imo there is the need in research how to decouple the model from the information. At least to some degree.
A big problem with current LLMs is that they are *too good* at learning.
Yes, they learn too well, they don't need to think they just learn the thing.
Reasoning is a way to shift the burden from memory to computation. If you have tons of storage space you're going to use it, if you have very little storage space you're going to be forced to compress as much as possible.

If you think about it, less parameters are easier to overfit than many parameters.

Koroistro
Автор

One of the best videos on AI I’ve seen in a while! No hype. All facts. Well explained while not shying away from complex topics. Beautiful explanation of fast Grok. You just earned yourself a sub!

Josiah
Автор

I am a big fan of Llama-3-70b model, but the fact that it achieves 0.049 on simple AIW questions tells that it is mostly memorization of MMLU rather than generalization that give rise of these results. Why it doesn't fail so much on AIW+ questions, simply because it have seen much more data, remember that we are talking about staggering 15T tokens of training data here.

perelmanych
Автор

I think 1.8T parameters grokked with „understanding“ of all logical states will become AGI.
This kind of power will turbo accelerate AI tech, since it can begin research itself.

RedOneM
Автор

They need benchmark Testing that has variations in questions inputs and randomization of choices

briangman