Physics of Language Models: Part 3.1 + 3.2, Knowledge Storage, Extraction and Manipulation

preview_player
Показать описание
Timecodes
0:00 - Prelude
6:59 - Toy Example and Motivation
12:07 - Definitions
16:07 - Result 1: Mixed Training
21:38 - Result 2: Pretrain and Finetune
23:37 - Result 3: Knowledge Augmentation
28:21 - Result 4: P-Probing
33:29 - Result 5: Q-Probing
36:25 - Result 6: Celebrity can help Minority
41:00 - Result 7: Bidirectional Model + MLM
46:02 - Start of Knowledge Manipulation
46:57 - Result 8: Knowledge Partial/Dual Retrieval
51:47 - Result 9: Knowledge Classification and Comparison
1:04:44 - Result 10: Knowledge Inverse Search (Reversal Curse)
1:15:37 - Conclusion

This is an expanded version of the talk I gave about the following two papers.

(Results 1-7)

(Results 8-10)
Рекомендации по теме
Комментарии
Автор

Wow, this is amazing! I can't wait for the part 3.3 video!

SrIgort
Автор

I really enjoy your talk. Super interesting, coherent and thought-provoking. Thanks! 😀

haolunwu
Автор

That is so fundamental and meaningful, thank you for your work

NeilHu-zuny
Автор

This is fire 🔥
Big thanks for making extensive explanation, this is really awesome work

Dizzy
Автор

I came across this after seeing the tutorial at ICML 2024. Thank you for the great lecture!

이지평-bo
Автор

I wonder if in-dist QA and in-dist BIO accuracy actually grow at the same rate (rather than QA faster) if we restricted the graphs at 20:48 to people not in the out-dist split. Maybe it's easier to learn information about these people simply because they appear more times in the training data, rather than the info being presented in BIO or QA form. But maybe I misunderstood something about how the datasets are designed and split. Interesting work!

ellel
Автор

May be a stupid question, is there a Part 2 paper in this series?

yohanhamilton
Автор

I read another paper, Let's Think Dot by Dot, which sort of complement Result 9. Would be even more interesting to see if "..." i.e., filler tokens, can help improve the experiment of Result 9.

fengliang
Автор

What if i put the augmented data into sft stage rather than pretrain stage, what will happen

余犇-wx
Автор

So what was the paper in FOCS before LoRA?

EugeneKrevenets
Автор

Another incidental but perplexing question, what is the difference between KNOWLEDGE OPERATION and REASONING?That is, what are the criteria that distinguish 2.1-2.2 from 3.2?

ironmanch
Автор

Is it possible that Inverse Knowledge Search, for a model like Bert, would be simpler?Is it worth comparing?

ironmanch
Автор

Could you share the slide deck? I'd like to talk my team through these results at a reading group

AB-cyfd