GROKKED LLM beats RAG Reasoning (Part 3)

Показать описание

We open the black box of GROKKED LLMs and analyze each layer of the transformer architecture for its performance in causal reasoning, after the grokking phase transition of our LLM.

Current research in AI clearly indicates that established LLMs, like Gemini Pro 1.5 or GPT-4 Turbo fail in deep reasoning, even when integrated in complex RAG systems.

A Grokking phase transition is essential for LLM to active their performance phase, reaching close to 99% accuracy for "un-seen" tasks in the development and test datasets.

#airesearch
#ainews
#insights

Рекомендации по теме

Комментарии

Honestly you have become my favorite vlogs. Just fantastic.

SirajFlorida

Amazing! Thanks so much for sharing your amazing work!

antoniosmusic

Would love a tutorial on grokking phi-3, this grokking thing is hard to wrap my head around.

christopherchilton-smith

In LLMs, is there a concept of "strong generalisation" (defined as two equivalent/identical networks trained on non-overlapping sets of data but both perform 100%) as seen in BF-CNNs?

It's a bit off topic but it's great work, also showing generalisation and geometric interpretations: "Generalization in diffusion models arises from geometry-adaptive harmonic representations". There's a great YouTube video, Zahra does a great talk on it. It builds on earlier work by Eero Simoncelli's team too, of "bias free CNNs" for denoising, which demonstrates without the bias parameters, the weights generalise much better, train on 5db of noise and it works on 30db of noise, where as a regular wx+b fails. They visualise the manifolds too, it's really a good explanation!

luke.perkin.inventor

I was sooo waiting for this follow-up :D. Thank you!🤩

tanguerok

Truly outstanding! Thank you so much for creating and sharing such high quality content!

LamontCranston-qhrv

As always, thank you for an amazing explanation and review!

mazakielad

Awesome video. So they have grokked some LLM to perform tests, but there is no grokked LLM that is in the public domain or publically accessible? Why? Or do I miss something?

hotbit

Incredible stuff, thank you for your work! this is just amazing

laslog

Fantastic video as usual... Typically I find myself full of ideas after watching your videos. This time I find myself unsure how I might implement this information moving forward... I guess I will have to sit with it for a while. Its not lost on me the irony that my favorite author growing up was Robert Heinlein and Stranger in a Strange Land the first book I read from him; yet, this is the one topic that I can not immediately use the knowledge in my projects... 😞

alexjensen

Ive been thinking about a kind of vector database for grokking, it seems it would still facilitate RAG quite nicely too... opinions?

alpineparrot

I think a good method of grokking would be to train on data compressed by synonyms

MagusArtStudios

I still don't understand why grokked models are against rag. Why can't we combine grokked models with rag systems?

manslaughterinc.

Great video thank you! Love your engaging style too :)

publicsectordirect

I have some doubts I would like to clear.
Will grokking be effective only focusing on the dataset construction if we choose to extend the fine tuning of preexisting pretrained transformer architectures such as llama 3 8b
Do you pretrain using existing data as atomic facts and use finetuning as inferred facts,
If you finetune, what strategy to you go by, do you finetune the entire network such that all gradients are affected and can hopefully all grads can reach the grokked state, this strategy might induce drastic forget-fullness, not to mention the mind splitting compute power required to essentially extend the pretraining.
Or do you do finetuning by something like peft or training the last few layers resulting resulting to not utilizing all the neurons in the grokked state and only trainable neurons essentially reaching grokked state.

And the most important one for me(prob), any resources on how to start coding a grokked transformer

BhaswataChoudhury

That's incredible! Does this mean that the path to AGI has been paved? Or am I overestimating the results?

timgorn

Could not wait for this one. At dinner, it dawned on me that papers on this topic from 3 years back were by OpenAI researchers. So if they played with this back then, are the current models even state of the art or are they much farther ahead and just milk the current models like their adoption parent did in the desktop application area? It would make the words of Sam true they will steamroll many of the current layer 1 derivatives like RAG and CoT. Someone else also commented this research is quite old, so if it is, why are we not having this reasoning already more ironed out and implemented in the OpenAI API's? Even Google could have implemented it, as much of Grokking research is tied to researchers from them.

mulderbm

I cant seem to find part two. Not easily searchable or linked in the description

publicsectordirect

The part 4 can be based on "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"

bernardoramos

GROKKED LLM beats RAG Reasoning (Part 3)

GROKKED LLM beats RAG Reasoning (Part 3)

RAG explained step-by-step up to GROKKED RAG sys

Explaining Grokking Through Circuit Efficiency

Chollet's ARC Challenge + Current Winners

NEW Pinecone Assistant

Can 'AI Scientist' Really Discover & Automate Research?