GPT-3: Language Models are Few-Shot Learners (Paper Explained)

preview_player
Показать описание
#gpt3 #openai #gpt-3

How far can you go with ONLY language modeling? Can a large enough language model perform NLP task out of the box? OpenAI take on these and other questions by training a transformer that is an order of magnitude larger than anything that has ever been built before and the results are astounding.

OUTLINE:
0:00 - Intro & Overview
1:20 - Language Models
2:45 - Language Modeling Datasets
3:20 - Model Size
5:35 - Transformer Models
7:25 - Fine Tuning
10:15 - In-Context Learning
17:15 - Start of Experimental Results
19:10 - Question Answering
23:10 - What I think is happening
28:50 - Translation
31:30 - Winograd Schemes
33:00 - Commonsense Reasoning
37:00 - Reading Comprehension
37:30 - SuperGLUE
40:40 - NLI
41:40 - Arithmetic Expressions
48:30 - Word Unscrambling
50:30 - SAT Analogies
52:10 - News Article Generation
58:10 - Made-up Words
1:01:10 - Training Set Contamination
1:03:10 - Task Examples

Abstract:
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Authors: Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Links:
Рекомендации по теме
Комментарии
Автор

OUTLINE:
0:00-Intro & OvervieW
1:20-Language Models
2:45-Language Modeling Datasets
3:20-Model Size
5:35-Transformer Models
7:25-Fine Tuning
10:15- In-Context Learning
17:15-Start of Experimental Results
19:10-Question Answering
23:10-What I think is happening
28:50- Translation
31:30-Winograd Schemes
33:00-Commonsense Reasoning
37:00- Reading Comprehension
37:30-SuperGLUE
40:40- NLI
41:40- Arithmetic Expressions
48:30- Word Unscrambling
50:30- SAT Analogies
52:10-News Article Generation
58:10-Made-up Words
1:01:10-Training Set Contamination
1:03:10-Task Examples

eternalsecretforgettingfor
Автор

Watching videos about large language models really makes me ask myself: "What is really "human" reasoning?" And how do humans learn stuff?
A great point on arithmetic operations!

that_guy
Автор

Imagine telling Alan Turing we created a 5.7 trillion bit program to answer "what is one plus one?" lol

larrybird
Автор

This is such a fun format for educational video! And with a huge backlog of videos that look worth checking out, there's so much to learn. 'Excited for this channel!

DeveloperDesmond
Автор

Explaining papers? That's awesome. Subscribed instantly. Thanks for your effort and please continue to do so.

KivySchool
Автор

Thank you so much for this. Your explanations are very clear and I appreciate you sharing your views on the paper. Keep up the good work!

lorenzoampil
Автор

Wow, an OUTLINE, I didn't know that was possible on YouTube :o
thx

Tondadrd
Автор

Great video. Your explanation made it clear to me the distinction between memorizing and reasoning, just like the two ways students study for tests. If the test contains mostly of problems encountered before, the students who memorize will likely perform better than ones who reason. Just as you pointed out, when one memorized the internet, there won't be a lot of things one hasn't seen.

lgoose
Автор

Really appreciate your insight that I otherwise wouldn't have got from just the paper.

tianyulu
Автор

I absolutely love your videos.Thank you so much for explaining everything so clearly.

marziehzargari
Автор

Thank you for the explanation. I really enjoyed learning about it and can't wait to, someday, be able to work with such models.

siddharthbhargava
Автор

Awesome summary Yannic - very informative. Thank you!

eddiesagra
Автор

Yannic, great presentation as always! But I think the power of transformer models is to "discover" structural similarities (frequent repeating structures). Many of these "rules" are not learned for exact input sequences but for sequences or co-occurrences of sets or classes of input symbols. This is IMO different from exact "regex-like" recall which would not tolerate different query representations. I think the embeddings on all layer-outputs are some form of thought- or summary-vectors that capture the gist of the context up to the current token. Attention can be seen as key-value store but I prefer to think of it as a soft read-memory and transform operation. The computational capabilities of transformer models are inherently limited by the number of feed-forward and attention steps but it has been shown with smaller models that this is enough for simple arithmetic operations which generalize outside numbers that were presented during training etc. While it is still not AGI I must personally say that I am again and again impressed by the "world-model" / knowledge-base that is generated via a "stupid" next or masked token prediction objective... ;-)

bluelng
Автор

Perfect. This is what I was looking for. A short self explanatory video and found it. Thank you

TusharKale
Автор

Yannic, thanks for this detailed breakdown of the paper - appreciate the way you have de-hyped it.

ThomasDawsonco
Автор

Your channel is a treasure, thanks for doing this (making videos in general I mean)

carlos
Автор

That's a great job! Thank you for all the insights!

PrzemekChojeckiAI
Автор

I am studying linguistics at uni and I'm writing my dissertation on whether humans can distinguish human from gpt-3-generated language. I am extending the findings of this paper by investigating the use of gpt-3 in social media, news and email contexts, using a large Turing-style survey whereby people are required to pick the AI response over the human one. I will apply the findings onto potential phishing, fake news and ethical implications. I study linguistics not computer science, so found this video extremely useful! Thank you for a great explanation.

catharinecox
Автор

Many thanks. I started reading this and quickly ran out of steam. You boiled this down nicely and I really appreciate your point that given the gigantic training set, they are likely "memorizing" relations in an unintended but superficially useful way. I hope that the community digs into this more deeply and can possibly turn this into a purposeful strategy... Sometimes brute force is effective, if not efficient.

JohnKruse
Автор

Great video, Yannic! Seriously this was fast, but then you've not compromised at all on quality bit. :)
Even I feel it has just memorized things more or less.

bhavulgauri