Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2)

Показать описание

See part 1 here: What is a transformer?

If you enjoyed this, I expect you'd enjoy learning more about what's actually going on inside these models and how to reverse engineer them! Check out:

Further resources:
Check out these other intros to transformers for another perspective:

Timestamps:
00:00 Intro
04:01 Recap
05:03 Setup
06:04 LayerNorm
23:35 Embedding
30:07 Attention
51:22 MLP
54:00 Transformer Block
56:40 Unembedding
58:50 Full Transformer
1:01:47 Trying it out
1:11:05 Training

Neel Nanda

Рекомендации по теме

Комментарии

I actually forgot how nice it was to watch someone write good code. Thank you!

AlexOlar

This was actually one of the best tutorial I've seen. Great video!

jordan

I wish I had found out that you had a channel when I started my Masters, right now I started exploring and getting started with Interpretability research, hope to get up and running soon. Thanks a lot for teaching stuff.

imvijay

Super informative walkthrough. Appreciate the effort to make things more simpler.

MegaManoja

Thanks for making these man. Ur knowledge shines bright and sharing it like this is very appreciated.

PoppySeed

your resources are awesome! Thanks for all your efforts to share your knowledge.

Jvo_Rien

Thanks Neel, found this to be really helpful in finally understanding transformers. Love your communication style!

mihirrege

To get rid of the excessive dubug printing replace all `if cfg.debug: print(...)` with `if self.cfg.debug: print(...)`. Current code always references first defined instance of Config called `cfg`

tomasturlik

great tutorial - very informative and following along with the demo notebook was great too! watched all the content

ldub

Hi Neel. This is the best tutorial on this topic I've seen, I understood everything except for one small part. I particularly like the heavy use of einops, makes everything very clear, at least for me. One question regarding loss function, I've seen in other implementations (eg. nanogpt by Karpathy) the following loss: F.cross_entropy(logits.view(-1, logits.size(-1)), tokens.view(-1)), which is slightly different from what you have (if I'm not mistaken what you have can be written as logits = logits[:, :-1], tokens = tokens[:, 1:] and only then feeding it into F.cross_entropy). So your implementations of the loss are just shifted by one. Can you elaborate on why would that be? Are both correct, or only one?

MikeOxmol_

Thanks for making this video available. It is most helpful.
Do you also write code for training?
Experiments I want to do involve modified training.

RalphDratman

Hi Neel. Thank you for this great tutorial. Do you have recommendations on similar technical tutorials. I'd like to see what day-to-day research / coding is like of top industry researchers. I'm interested in learning topics like Instruction tuning and RLHF.

kejianshi

Thank you for the very informative video! Do you know why we have bias vectors for each of the query, key, and value vectors in the attention module? I thought because of the pre layernorm also having a learnable bias that it was redunant to also have the bias vectors here as well when we are doing the linear transformations.

Is it similar to the explanation you gave on why you implement the bias in the unembedding layer as well?

NiranjanSenthilkumar

Indeed I did find this super useful and also actually bothered watching the whole thing :p

davidcato

Aside from cargo-culting GPT, is there a reason for setting n_heads and d_head such that n_heads * d_head = d_model? Initially I thought we were going to concatenate the outputs of each of the heads, hence we need this equation to hold, but if we are summing them outputs of the heads up, it looks like we are free to choose n_heads and d_head differently?

venusatuluri

What are the specs of the rig needed for this?

matthewpublikum

I think there's a typo in the intro code:

> log_probs = logits.log_softmax(dim=-1)
> probs = logits.log_softmax(dim=-1)

Should have this instead?

> probs = logits.softmax(dim=-1)

EvanDaniel

Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2)

Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2)

Let's build GPT: from scratch, in code, spelled out.

Let's reproduce GPT-2 (124M)

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

Text Generation with Transformers (GPT-2) In 10 Lines Of Code

311 - Fine tuning GPT2 using custom documents​

GPT in PyTorch

Create a Large Language Model from Scratch with Python – Tutorial

IA382 - Seminar in Computer Engineering: 'Generalist vs Specialist Language Models' by R. ...

Transformers, explained: Understand the model behind GPT, BERT, and T5

Create GPT Neural Network From Scratch in 40 Minute - #pytorch #transformers #machinelearning

Pytorch Transformers from Scratch (Attention is all you need)

Building a GPT from scratch using PyTorch - dummyGPT

Training GPT2 From Scratch In Hugging Face | Generative AI with Hugging Face | Ingenium Academy

Generative Python Transformer p.5 - Training and some testing of GPT-2 model

3- Text Generation with GPT2 Model using HuggingFace | NLP Hugging Face Project Tutorial

Text Generation using GPT2

Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

Fine tuning gpt2 | Transformers huggingface | conversational chatbot | GPT2LMHeadModel

Tutorial 1-Transformer And Bert Implementation With Huggingface

What is a Transformer? (Transformer Walkthrough Part 1/2)

Train GPT2 on Indian Language Dataset | DataHour by Aashay Sachdeva

Let's build the GPT Tokenizer

Generate Blog Posts with GPT2 & Hugging Face Transformers | AI Text Generation GPT2-Large

311 - Fine tuning GPT2 using custom documents