Physics of Language Models: Part 3.1 + 3.2, Knowledge Storage, Extraction and Manipulation

Показать описание

Timecodes
0:00 - Prelude
6:59 - Toy Example and Motivation
12:07 - Definitions
16:07 - Result 1: Mixed Training
21:38 - Result 2: Pretrain and Finetune
23:37 - Result 3: Knowledge Augmentation
28:21 - Result 4: P-Probing
33:29 - Result 5: Q-Probing
36:25 - Result 6: Celebrity can help Minority
41:00 - Result 7: Bidirectional Model + MLM
46:02 - Start of Knowledge Manipulation
46:57 - Result 8: Knowledge Partial/Dual Retrieval
51:47 - Result 9: Knowledge Classification and Comparison
1:04:44 - Result 10: Knowledge Inverse Search (Reversal Curse)
1:15:37 - Conclusion

This is an expanded version of the talk I gave about the following two papers.

(Results 1-7)

(Results 8-10)

Рекомендации по теме

Комментарии

Wow, this is amazing! I can't wait for the part 3.3 video!

SrIgort

I really enjoy your talk. Super interesting, coherent and thought-provoking. Thanks! 😀

haolunwu

That is so fundamental and meaningful, thank you for your work

NeilHu-zuny

This is fire 🔥
Big thanks for making extensive explanation, this is really awesome work

Dizzy

I came across this after seeing the tutorial at ICML 2024. Thank you for the great lecture!

이지평-bo

I wonder if in-dist QA and in-dist BIO accuracy actually grow at the same rate (rather than QA faster) if we restricted the graphs at 20:48 to people not in the out-dist split. Maybe it's easier to learn information about these people simply because they appear more times in the training data, rather than the info being presented in BIO or QA form. But maybe I misunderstood something about how the datasets are designed and split. Interesting work!

ellel

May be a stupid question, is there a Part 2 paper in this series?

yohanhamilton

I read another paper, Let's Think Dot by Dot, which sort of complement Result 9. Would be even more interesting to see if "..." i.e., filler tokens, can help improve the experiment of Result 9.

fengliang

What if i put the augmented data into sft stage rather than pretrain stage, what will happen

余犇-wx

So what was the paper in FOCS before LoRA?

EugeneKrevenets

Another incidental but perplexing question, what is the difference between KNOWLEDGE OPERATION and REASONING?That is, what are the criteria that distinguish 2.1-2.2 from 3.2?

ironmanch

Is it possible that Inverse Knowledge Search, for a model like Bert, would be simpler?Is it worth comparing?

ironmanch

Could you share the slide deck? I'd like to talk my team through these results at a reading group

AB-cyfd

Physics of Language Models: Part 3.1 + 3.2, Knowledge Storage, Extraction and Manipulation

Physics of Language Models: Part 1, Context-Free Grammar

ICML 2024 Tutorial: Physics of Language Models

Physics of Language Models: Part 3.1 + 3.2, Knowledge Storage, Extraction and Manipulation

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Physics of Language Models: Part 1

Physics of Language Models | Large Language Models (LLMs)

Physics of Language Models - Extracting Knowledge

You Can Control How Well AI Models Learn By Controlling Physics

Physics of Language Models: Understanding the Fundamentals

Yuanzhi Li | Physics of Language Models: Knowledge Storage, Extraction, and Manipulation

LLM Explained | What is LLM

Transformers (how LLMs work) explained visually | DL5

Centripetal or Centrifugal Force Demo? #physics

On large language models and transformers: perspectives from physics, neuroscience, and theory

Nature Reviews Physics: Science in the age of Large Language Models

2024's Biggest Breakthroughs in Computer Science

The Biggest Physics News of 2024

Deep Dive: Quantizing Large Language Models, part 1

Large language models for problems in Physics

👀 Asking GCSE Students (Hamdi) How Much They Physics They Know - Part 1 #Shorts

To Strawberry and Beyond: Insights from 'Language Model Physics' Paper

Not all language model features are linear | Josh Engels | BITS Physics of Intelligence

Ethan Dyer - “Lessons from scale for large language models and quantitative reasoning”