Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2)

preview_player
Показать описание
See part 1 here: What is a transformer?

If you enjoyed this, I expect you'd enjoy learning more about what's actually going on inside these models and how to reverse engineer them! Check out:

Further resources:
Check out these other intros to transformers for another perspective:

Timestamps:
00:00 Intro
04:01 Recap
05:03 Setup
06:04 LayerNorm
23:35 Embedding
30:07 Attention
51:22 MLP
54:00 Transformer Block
56:40 Unembedding
58:50 Full Transformer
1:01:47 Trying it out
1:11:05 Training
Рекомендации по теме
Комментарии
Автор

I actually forgot how nice it was to watch someone write good code. Thank you!

AlexOlar
Автор

This was actually one of the best tutorial I've seen. Great video!

jordan
Автор

I wish I had found out that you had a channel when I started my Masters, right now I started exploring and getting started with Interpretability research, hope to get up and running soon. Thanks a lot for teaching stuff.

imvijay
Автор

Super informative walkthrough. Appreciate the effort to make things more simpler.

MegaManoja
Автор

Thanks for making these man. Ur knowledge shines bright and sharing it like this is very appreciated.

PoppySeed
Автор

your resources are awesome! Thanks for all your efforts to share your knowledge.

Jvo_Rien
Автор

Thanks Neel, found this to be really helpful in finally understanding transformers. Love your communication style!

mihirrege
Автор

To get rid of the excessive dubug printing replace all `if cfg.debug: print(...)` with `if self.cfg.debug: print(...)`. Current code always references first defined instance of Config called `cfg`

tomasturlik
Автор

great tutorial - very informative and following along with the demo notebook was great too! watched all the content

ldub
Автор

Hi Neel. This is the best tutorial on this topic I've seen, I understood everything except for one small part. I particularly like the heavy use of einops, makes everything very clear, at least for me. One question regarding loss function, I've seen in other implementations (eg. nanogpt by Karpathy) the following loss: F.cross_entropy(logits.view(-1, logits.size(-1)), tokens.view(-1)), which is slightly different from what you have (if I'm not mistaken what you have can be written as logits = logits[:, :-1], tokens = tokens[:, 1:] and only then feeding it into F.cross_entropy). So your implementations of the loss are just shifted by one. Can you elaborate on why would that be? Are both correct, or only one?

MikeOxmol_
Автор

Thanks for making this video available. It is most helpful.
Do you also write code for training?
Experiments I want to do involve modified training.

RalphDratman
Автор

Hi Neel. Thank you for this great tutorial. Do you have recommendations on similar technical tutorials. I'd like to see what day-to-day research / coding is like of top industry researchers. I'm interested in learning topics like Instruction tuning and RLHF.

kejianshi
Автор

Thank you for the very informative video! Do you know why we have bias vectors for each of the query, key, and value vectors in the attention module? I thought because of the pre layernorm also having a learnable bias that it was redunant to also have the bias vectors here as well when we are doing the linear transformations.

Is it similar to the explanation you gave on why you implement the bias in the unembedding layer as well?

NiranjanSenthilkumar
Автор

Indeed I did find this super useful and also actually bothered watching the whole thing :p

davidcato
Автор

Aside from cargo-culting GPT, is there a reason for setting n_heads and d_head such that n_heads * d_head = d_model? Initially I thought we were going to concatenate the outputs of each of the heads, hence we need this equation to hold, but if we are summing them outputs of the heads up, it looks like we are free to choose n_heads and d_head differently?

venusatuluri
Автор

What are the specs of the rig needed for this?

matthewpublikum
Автор

I think there's a typo in the intro code:

> log_probs = logits.log_softmax(dim=-1)
> probs = logits.log_softmax(dim=-1)

Should have this instead?

> probs = logits.softmax(dim=-1)

EvanDaniel