OpenAI GPT-3: Language Models are Few-Shot Learners

preview_player
Показать описание
**ERRATA**: Open AI/GPT-3 DOES NOT USE Microsoft's ZeRO/DeepSpeed for training

In this episode of Machine Learning Street Talk, Tim Scarfe, Yannic Kilcher and Connor Shorten discuss their takeaways from OpenAI’s GPT-3 language model. OpenAI trained a 175 BILLION parameter autoregressive language model. The paper demonstrates how self-supervised language modelling at this scale can perform many downstream tasks without fine-tuning.

00:00:00 Intro
00:00:54 ZeRO1+2 (model + Data parallelism) [GPT-3 DOES *NOT* USE THIS] (Connor)
00:03:17 Recent history of NLP (Tim)
00:06:04 Yannic "Light-speed" Kilcher's brief overview of GPT-3
00:14:25 Reviewing Yannic's YT comments on his GPT-3 video (Tim)
00:20:26 Main show intro
00:23:03 Is GPT-3 reasoning?
00:28:15 Architecture discussion and autoregressive (GPT*) vs denoising autoencoder (BERT)
00:36:18 Utility of GPT-3 in industry
00:43:03 Can GPT-3 do math? (reasoning/system 1/system 2)
00:51:03 Generalisation
00:56:48 Esoterics of language models
00:58:46 Architectural trade-offs
01:07:37 Memorization machines and intepretability
01:17:16 Nearest neighbour probes / watermarks
01:20:03 YouTube comments on GPT-3 video
01:21:50 GPT-3 news article generation issue
01:27:36 Sampling data for language models / bias / fairness / politics
01:51:12 Outro

These paradigms of task adaptation are divided into zero, one, and few shot learning. Zero-shot learning is a very extreme case where we expect a language model to perform a task such as sentiment classification or extractive question answering, without any additional supervision. One and Few-shot learning provide some examples to the model. However, GPT-3s definition of this diverges a bit from the conventional literature. GPT-3 provides one and few-shot examples in the form of “In-Context Learning”. Instead of fine-tuning the model on a few examples, the model has to use the input to infer the downstream task. For example, the GPT-3 transformer has an input sequence of 2048 tokens, so demonstrations of a task such as yelp sentiment reviews, would have to fit in this input sequence as well as the new review.

**ERRATA-continued** It has come to our attention that there was a serious factual error in our video -- GPT-3 DOES NOT USE Microsoft's ZeRO/ZeRO2 or DeepSpeed for training and there is no reference to this in either their blog post or paper. We are really sorry about this mistake and will be more careful to fact-check in future.

Thanks for watching! Please Subscribe!

Paper Links:

#machinelearning #naturallanguageprocessing #deeplearning #gpt3
Рекомендации по теме
Комментарии
Автор

ERRATA: Sorry for the mixup, GPT-3 does not actually use ZeRO or DeepSpeed!

MachineLearningStreetTalk
Автор

Thank you for a great video - love your editing and clarity of explanation.

PrzemekChojeckiAI
Автор

Informative discussion guys - thank you! Really liked the discussion on the age of the data making up the corpus - hadn't thought about this before :)

eddiesagra
Автор

00:00:00 Intro
00:00:54 ZeR01 +2 (model + Data parallelism) [GPT-3
DOES *NOT* USE THIS] (Connor)
00:03:17 Recent history of NLP (Tim)
00:06:04 Yannic "Light-speed" Kilcher's brief
overview of GPT-3
00:14:25 Reviewing Yannic's YT comments on his
GPT-3 video (Tim)
00:20:26 Main show intro
00:23:03 Is GPT-3 reasoning?
00:28:15 Architecture discussion and autoregressive
(GPT*) vs denoising autoencoder (BERT)
00:36:18 Utility of GPT-3 in industry
00:43:03 Can GPT-3 do math? (reasoning/system 1/
system 2)
00:51:03 Generalisation
00:56:48 Esoterics of language models
00:58:46 Architectural trade-offs
01:07:37 Memorization machines and intepretability
01:17:16 Nearest neighbour probes/ watermarks
01:20:03 YouTube comments on GPT-3 video
01:21:50 GPT-3 news article generation issue
01:27:36 Sampling data for language models/ bias/
fairness/ politics
01:51:12 Outro

eternalsecretforgettingfor
Автор

The review on GPT-3 along with a push in subscriptions owing to the recent popular paper reviews such as ResNet, Word2Vec, etc. (Plus years of hard-work) have made @Yannic an overnight star :) .

LNJP
Автор

I understood like 5% of what you said but my brain is slowly converging to understand it better and better :D Thanks for your video! Will watch sequences of it as my new Netflix & Think practice.

JousefM
Автор

re:Utility of GPT-3 in industry, on the topics of knowledge mining, regardless of the model used for inference, I'm not sure there is a good way for data sensitivity classification yet? Without a good data protection mechanism, perhaps few would start pouring all documents into any system based on them. Further, I think just like adversarial attacks on face recognition algo, perhaps we could also see attempts on fooling GPT3 using specially crafted phrases?

iuhh
Автор

Ok if this guy removes his “top gun” glasses may be we can get more of whats he really want to say

ctpact
Автор

Was not expecting the Arnold clip. I lol'd.

snippletrap
Автор

No guest!!! Yet, it's interesting.
How true is that training GPT-3 costed them

vinayreddy
Автор

Interesting debates. It would be useful if you linked the other articles showed during the episode.

fabmilo
Автор

I am in agreement that model recites from memory. I test text generation to write a 'story.' Each yarn spun by these models can actually be located to an existing book.
Albeit, the model is really powerful when used to automate comprehension, classification and extraction tasks. The worth of language models in these tasks is essence of 'no code', especially GPT. You can teach the model on a task like you would teach a 9 year old using just English syantax.

gibreel
Автор

You mentioned question answering, as opposed to the typical question asking. From an education perspective this is an important shift from consumerism to production. Something I'm interested in is the capacity of these models be be tuned on down stream task that can ask meaningful question about arbitrary input text to enhance a human leaner compression and recite salient facts or even concepts useful to their filed of study. Imagine using BERT tuned on question/answer pairs to support a learner's journey to internalizing essential facts and knowledge that elevate them to a level of reasoning about the acquired knowledge. Could this be a natural collaboration rather rather some dichotomous competition.

duxoroxor
Автор

Great video. Now I don't see gpt3 usefull for knowledge mining. I feel my hands are tied if I wanted to fine-tune the model to my NLP task. I would prefer Bert in that mather

crimythebold
Автор

This is great. I was wondering, what is the software used in 3:35? Neat visualization

imranq
Автор

can we use it? from GPT-3 import tockenizer ....?

bryancc
Автор

I've heard that the cerebellum of the brain learns "small programs" so as to execute them fast and sort of automatically. The 'reasoning' part of our brain creates/distills those programs and passes them to the cerebellum so it seems we need to invent the reasoning system. What's interesting is that a human can live a "normal" life without the cerebellum but they can't execute these automatic tasks fast (which of course is terrible) but they are functional

ikoukas
Автор

Indeed, we shouldn't think the computational capacity in this era is special. Something that out of the computational capacity may make some magic!

anonymous
Автор

can gpt-3 be used to use their embeddings for topic modeling?

monart
Автор

The unscrambling task confused me a bit. I mean if you scramble a word how can you be sure such a "scrambled" word would be in the vocabulary in order to assign a token (number) to such word and could approach the task?
Maybe I am confused but as far as I understand, each word has a token representation in the language model and such tokenization comes from the training set, doesn't it?
Thanks!
Amazing videos and discussions!

dariodemattiesreyes