Let's build GPT: from scratch, in code, spelled out.

Показать описание

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.

Links:

Supplementary links:

Suggested exercises:
- EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).
- EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?
- EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?

Chapters:
00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespeare
baseline language modeling, code setup
00:07:52 reading and exploring the data
00:09:28 tokenization, train/val split
00:14:27 data loader: batches of chunks of data
00:22:11 simplest baseline: bigram language model, loss, generation
00:34:53 training the bigram model
00:38:00 port our code to a script
Building the "self-attention"
00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation
00:47:11 the trick in self-attention: matrix multiply as weighted aggregation
00:51:54 version 2: using matrix multiply
00:54:42 version 3: adding softmax
00:58:26 minor code cleanup
01:00:18 positional encoding
01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention
01:11:38 note 1: attention as communication
01:12:46 note 2: attention has no notion of space, operates over sets
01:13:40 note 3: there is no communication across batch dimension
01:14:14 note 4: encoder blocks vs. decoder blocks
01:15:39 note 5: attention vs. self-attention vs. cross-attention
01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)
Building the Transformer
01:19:11 inserting a single self-attention block to our network
01:21:59 multi-headed self-attention
01:24:25 feedforward layers of transformer block
01:26:48 residual connections
01:32:51 layernorm (and its relationship to our previous batchnorm)
01:37:49 scaling up the model! creating a few variables. adding dropout
Notes on Transformer
01:42:39 encoder vs. decoder vs. both (?) Transformers
01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention
01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
01:54:32 conclusions

Corrections:
00:57:00 Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)
01:20:05 Oops I should be using the head_size for the normalization, not C

Рекомендации по теме

Комментарии

Imagine being between your job at Tesla and your job at OpenAI, being a tad bored and, just for fun, dropping on YouTube the best introduction to deep-learning and NLP from scratch so far, for free. Amazing people do amazing things even for a hobby.

fgfanta

Living in a world where a world-class top guy posts a 2-hour video for free on how to make such cutting-edge stuff. I barely started this tutorial but at first I just wanted to say thank you mate!

LFrank

Wow! I knew nothing and now I am enlightened! I actually understand how this AI/ML model works now. As a near 70 year old that just started playing with Python, I am a living example of how effective this lecture is. My humble thanks to Andrej Karpathy for allowing to see into and understand this emerging new world.

jamesfraser

I am a college professor and learning GPT from Andrej. Every time I watch this video, I not only I learn the contents, also how to deliver any topic effectively. I would vote him as the "Best AI teacher in YouTube”. Salute to Andrej for his outstanding lectures.

BAIR

Thank you for taking the time to create these lectures. I am sure it takes a lot of time and effort to record and cut these. Your effort to level up the the community is greatly appreciated. Thanks Andrej.

softwaredevelopmentwiththo

It is difficult to comprehend how lucky we are to have you teaching us. Thank you, Andrej.

antopolskiy

I knew only python, math and definitions of NN, GA, ML and DNN. In 2 hours, this lecture has not only given me the understanding of GPT model, but also taught me how to read AI papers and turn them into code, how to use pytoch, and tons of AI definitions. This is the best lecture and practical application on AI. Because it not only gives you an idea of DNN, but also give you code directly from research papers and a final product. Looking forward to more lectures like these. Thanks Andrej Karpathy.

fslurrehman

Most clear and intuitive and well explained transformer video I've ever seen. Watched it as if it were a tv show and that's how down-to-earth this video is. Shoutout to the man of legend.

aojiao

What a feeling ! Just finished sitting on this for the weekend, building along and finally understanding Transformers. More than anything, a sense of fulfilment. Thanks Andrej.

meghanaiitb

Andrej, I cannot comprehend how much effort you have put in making these videos. Humanity is thankful to you for making these publically available and educating us with your wisdom. One thing is to know the stuff and apply it in corp setting and another thing is to use that instead to educate millions for free. This is one of the best kind of charity a CS major can do. Kudos to you and thank you so much for doing this.

JainPuneet

I cannot thank you enough for this material. I've been a spoken language technologist for 20 years and this plus your micro-grad and make more videos has given me a graduate level update in less than 10 hours. Astonishingly well-prepared and presented material. Thank you.

coopokb

Broke my back just to finish this video in single sitting. Its a lot to take at once, i think I'll have to implement it bit by bit in a span of day to actually assimilate everything.
I am very happy from the lecture/tutorial, waiting for more. Time and effort in making this video possible is highly admirable and respectable.

Thank you Andrej.

Grey_

I was always scared of Transformer's diagram. Honestly, I never understood how such schema could make sense until this day when Andrej enlightened us with his super teaching power. Thank you so much! Andrej, please save the day again by doing one more class about Stable Diffusion!! Please, you are the best!

rafaelsouza

This is AMAZING! You're an absolute legend for sharing your knowledge so freely like this Andrej! I'm finally getting some time to get into transformer architectures this is a brilliant deep dive, going to spend the weekend walking through it!! Thank you🙏🏽

NicholasRenotte

So happy to see Andrej back teaching more. His articles before Tesla were so illuminating and distilled complicated concepts into things we could all learn from. A true art. Amazing to see videos too.

thegrumpydeveloper

This lecture answers ALL my questions from the 2017 Attention Is All You Need paper. I am alway curious about the code behind Transformer. This lecture quenched my curiosity with a colab to tinker with. Thank you so much for your effort and time in creating the lecture to spread the knowledge!

ShihgianLee

Wow! Having the ex-lead of ML at Tesla make tutorials on ML is amazing. Thank you for producing these resources!

gokublack

I'm enjoying this whole series so much Andrej. They make me understand neural networks much better then anything so far in my Bachelor. As an older student that has a large incentive to be time efficient, this has been a gold send. Thank you so much!! :D

Marius

Thank you for taking the time and effort to share this, Andrej! This is of great help to lift the veil of abstractions that made it all seem inaccessible and opening up that world to ML/AI uninitiated like me. I don’t understand all of it yet but I’m now oriented and you’ve given me a lot of threads I can pull on.

nazgulizm

Just gone through all of his videos - MLP, Gradients and of course the backprop :), and finally finishing with the transformer model (decoder part). As we all know Andrej is the hero of deep learning and we are very much blessed to get this much of rich contents for free in YouTube, also from a teacher like him. Fascinating staff from a fascinating contributor in the field of AI 🙏

rangilanaoermajhi

Let's build GPT: from scratch, in code, spelled out.

Let's build GPT: from scratch, in code, spelled out.

Let's reproduce GPT-2 (124M)

Let's build the GPT Tokenizer

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

GPT-4o is here! Let’s build 4 things with it! | API

Let's Build: Automate ANYTHING With AI Agent Teams (Step-by-Step)

Machine Learning From Zero to GPT in 40 Minute

Building a GPT from scratch using PyTorch - dummyGPT

Build a Chat GPT Prompt Library Extension with Bubble in 10 Minutes

🐙 Lunch & Learn: Let's Build An AI Assistant With GPT-4o (w/ Joe & Winston of @Posit)...

Agent GPT lets you create a full presentation from scratch

Let's build GPT with memory: learn to code a custom LLM (Coding a Paper - Ep. 1)

Create a Python GPT Chatbot - In Under 4 Minutes

Build AI Agents with GPT-4o from scratch 🔥

Fresh Chicken Nuggets

Large Language Models from scratch

Building a neural network FROM SCRATCH (no Tensorflow/Pytorch, just numpy & math)

GPT Pilot Lets build something special part 1

Let`s Build AI: MEGA AUTOCOMPLETER (LLM Powered) | AI Engineer Project

How to Create Custom GPT | OpenAI Tutorial

Pytorch Transformers from Scratch (Attention is all you need)

Build AI Apps with ChatGPT, DALL-E, and GPT-4 – Full Course for Beginners

Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2)

Can AI code Flappy Bird? Watch ChatGPT try