GPT-3: Language Models are Few-Shot Learners (Paper Explained)

Показать описание

#gpt3 #openai #gpt-3

How far can you go with ONLY language modeling? Can a large enough language model perform NLP task out of the box? OpenAI take on these and other questions by training a transformer that is an order of magnitude larger than anything that has ever been built before and the results are astounding.

OUTLINE:
0:00 - Intro & Overview
1:20 - Language Models
2:45 - Language Modeling Datasets
3:20 - Model Size
5:35 - Transformer Models
7:25 - Fine Tuning
10:15 - In-Context Learning
17:15 - Start of Experimental Results
19:10 - Question Answering
23:10 - What I think is happening
28:50 - Translation
31:30 - Winograd Schemes
33:00 - Commonsense Reasoning
37:00 - Reading Comprehension
37:30 - SuperGLUE
40:40 - NLI
41:40 - Arithmetic Expressions
48:30 - Word Unscrambling
50:30 - SAT Analogies
52:10 - News Article Generation
58:10 - Made-up Words
1:01:10 - Training Set Contamination
1:03:10 - Task Examples

Abstract:
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Authors: Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Links:

Рекомендации по теме

Комментарии

OUTLINE:
0:00-Intro & OvervieW
1:20-Language Models
2:45-Language Modeling Datasets
3:20-Model Size
5:35-Transformer Models
7:25-Fine Tuning
10:15- In-Context Learning
17:15-Start of Experimental Results
19:10-Question Answering
23:10-What I think is happening
28:50- Translation
31:30-Winograd Schemes
33:00-Commonsense Reasoning
37:00- Reading Comprehension
37:30-SuperGLUE
40:40- NLI
41:40- Arithmetic Expressions
48:30- Word Unscrambling
50:30- SAT Analogies
52:10-News Article Generation
58:10-Made-up Words
1:01:10-Training Set Contamination
1:03:10-Task Examples

eternalsecretforgettingfor

Watching videos about large language models really makes me ask myself: "What is really "human" reasoning?" And how do humans learn stuff?
A great point on arithmetic operations!

that_guy

Imagine telling Alan Turing we created a 5.7 trillion bit program to answer "what is one plus one?" lol

larrybird

This is such a fun format for educational video! And with a huge backlog of videos that look worth checking out, there's so much to learn. 'Excited for this channel!

DeveloperDesmond

Explaining papers? That's awesome. Subscribed instantly. Thanks for your effort and please continue to do so.

KivySchool

Thank you so much for this. Your explanations are very clear and I appreciate you sharing your views on the paper. Keep up the good work!

lorenzoampil

Wow, an OUTLINE, I didn't know that was possible on YouTube :o
thx

Tondadrd

Great video. Your explanation made it clear to me the distinction between memorizing and reasoning, just like the two ways students study for tests. If the test contains mostly of problems encountered before, the students who memorize will likely perform better than ones who reason. Just as you pointed out, when one memorized the internet, there won't be a lot of things one hasn't seen.

lgoose

Really appreciate your insight that I otherwise wouldn't have got from just the paper.

tianyulu

I absolutely love your videos.Thank you so much for explaining everything so clearly.

marziehzargari

Thank you for the explanation. I really enjoyed learning about it and can't wait to, someday, be able to work with such models.

siddharthbhargava

Awesome summary Yannic - very informative. Thank you!

eddiesagra

Yannic, great presentation as always! But I think the power of transformer models is to "discover" structural similarities (frequent repeating structures). Many of these "rules" are not learned for exact input sequences but for sequences or co-occurrences of sets or classes of input symbols. This is IMO different from exact "regex-like" recall which would not tolerate different query representations. I think the embeddings on all layer-outputs are some form of thought- or summary-vectors that capture the gist of the context up to the current token. Attention can be seen as key-value store but I prefer to think of it as a soft read-memory and transform operation. The computational capabilities of transformer models are inherently limited by the number of feed-forward and attention steps but it has been shown with smaller models that this is enough for simple arithmetic operations which generalize outside numbers that were presented during training etc. While it is still not AGI I must personally say that I am again and again impressed by the "world-model" / knowledge-base that is generated via a "stupid" next or masked token prediction objective... ;-)

bluelng

Perfect. This is what I was looking for. A short self explanatory video and found it. Thank you

TusharKale

Yannic, thanks for this detailed breakdown of the paper - appreciate the way you have de-hyped it.

ThomasDawsonco

Your channel is a treasure, thanks for doing this (making videos in general I mean)

carlos

That's a great job! Thank you for all the insights!

PrzemekChojeckiAI

I am studying linguistics at uni and I'm writing my dissertation on whether humans can distinguish human from gpt-3-generated language. I am extending the findings of this paper by investigating the use of gpt-3 in social media, news and email contexts, using a large Turing-style survey whereby people are required to pick the AI response over the human one. I will apply the findings onto potential phishing, fake news and ethical implications. I study linguistics not computer science, so found this video extremely useful! Thank you for a great explanation.

catharinecox

Many thanks. I started reading this and quickly ran out of steam. You boiled this down nicely and I really appreciate your point that given the gigantic training set, they are likely "memorizing" relations in an unintended but superficially useful way. I hope that the community digs into this more deeply and can possibly turn this into a purposeful strategy... Sometimes brute force is effective, if not efficient.

JohnKruse

Great video, Yannic! Seriously this was fast, but then you've not compromised at all on quality bit. :)
Even I feel it has just memorized things more or less.

bhavulgauri

GPT-3: Language Models are Few-Shot Learners (Paper Explained)

GPT-3: Language Models are Few-Shot Learners (Paper Explained)

OpenAI GPT-3: Language Models are Few-Shot Learners

[research paper review] GPT-3 : Language Models are Few-Shot Learners

GPT-3: Language Models are Few-shot Learners

Language Models are Few-Shot Learners -- GPT-3 Paper

L19.5.2.5 GPT-v3: Language Models are Few-Shot Learners

GPT-3 Paper: Language Models are Few-Shot Learners - In-Depth Presentation

GPT3: Language models are few shot learners | Paper review

DSPy: Advanced RAG?

GPT-3 - 'Language Models Are Few Shot Learners' Presentation

GPT-J(GPT 3) Few Shot Learning: Teaching The Model With Few Examples

In-depth review of OpenAI's GPT-3 : Language Models are Few-Shot Learners (Part 1/3: Intro&...

PR-256: GPT-3 : Language Models are Few-Shot Learners

What Is A Language Model? GPT-3: Language Models Are Few-Shot Learners #GPT3 (part 2)

In-depth review of OpenAI's GPT-3 : Language Models are Few-Shot Learners (Part 2/3: Results)

OpenAI's Language Generator: GPT | The first AI Generating Text, Code, Websites...

[Paper Review] Language Models are Few Shot Learners and GPT Practical Tips

GPT-3 (2020) ★Abstract ★ Language Models are Few-Shot Learners

Small Language Models Are Also Few-Shot Learners

GPT-3 isn't reasoning

In-depth review of OpenAI's GPT-3 : Language Models are Few-Shot Learners (Part 3/3: Results&am...

SEM2020: Шамшиев Мамат 'GPT-3: Language Models are Few-Shot Learners'

Pattern Exploiting Training explained! | PET, iPET, ADAPET

198. Language Models are Few Shot Learners 1