Let's build the GPT Tokenizer

preview_player
Показать описание
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)

Exercises:

Links:

Supplementary links:
Рекомендации по теме
Комментарии
Автор

I'm amazed at the times we live in. One of the top AI experts in the world is sharing free tutorials that teaches technologies in great detail including examples and code. Thank you very much.

gustavojuantorena
Автор

You could have made a million bucks or spent time on a beach with a cocktail. Instead you chose to do this for the benefit of unknown strangers. You are an amazingly talented and generous man. Thank you so much!

SydneyPanda
Автор

I cannot express enough how grateful I am for your incredible tutorials. As a 44-year-old South Sudanese individual with three children and a full-time job, recently relocated from the UK to the US, my days are filled to the brim with responsibilities. However, I have made it a priority to utilize every ounce of my free time to follow your series of lectures. I must say, I have never been as excited about learning a new topic as I am now. Your clear explanations and engaging content have ignited a passion for learning within me that I never knew existed. Thank you from the bottom of my heart for all that you do.

kuoldeng
Автор

The fact that Andrej prefers being here with us over developing self driving or developing AGI is surely one of the great things about living in the present times.

jstello
Автор

This guy left Tesla, OpenAI, and God knows how many other companies so he can teach us how the state of art AI works, from scratch, clearly explained. Such a pure gem. Thank you very very very much.

MShahbazKharal
Автор

Whenever Karpathy leaves a job the ML world jumps forward a bit with his educational content

julianhunt
Автор

This is honestly better than OpenAI releasing a new model

DavidOndrej
Автор

When andrej uploads i immediately drop everything and watch the complete video. Thank you so much for your valuable content!

niceguysayshi
Автор

Andrej, content like this is advancing the industry more than most companies are. Thank you for educating us. You are pushing everybody forwards!

yoJuicy
Автор

the intrinsic value in andrej making videos for the world is huge. Please andrej keep doing this, no other jobs can output the alpha you're delivering here, pushing the boundaries of the best educational content on the planet. You are born for this.

entonbot
Автор

Thank you, Andrey! In an environment where 99% of conversations about LLM are marketing, technical videos are especially valuable and interesting!

Pythoncode-daily
Автор

Been waiting on the lecture to drop ever since we saw the Github repo go public.
Andrej, on behalf of the academic and professional communities, thank you for putting the effort into creating this open knowledge.
On a personal note, I've been following your work since the early OpenAI days, and your passion during keynotes/presentations has supported my choice of pursuing a PhD in the field.
Thank you Andrej.

TheMoefd
Автор

Feynman-class teaching abilities. Amazing. Thank you very much as always Andrej

Alilinpow
Автор

The best thing I have learnt from Andrej's videos is 'How to learn?' Break things down to axiomatic levels and then connect those pieces together! Its truly enlightening!

piper_of_the_dawn
Автор

Thank you Andrej Karpathy. Thanks for teaching Python libraries, GPT generator codes. Great work

LLMTokenizationID
Автор

Andrej, you're really good at teaching. We're lucky that you're spending your valuable time to prepare these free contents.

emranmohammadabuanas
Автор

OpenAI was such a closed cult, Andrej gave himself to the world.
🌹❤️🌹
Andrej deserves his flowers.

justjeremiah
Автор

Sorry Andrej, but you've missed the mark: you tell us that you don't like this topic, but you've still managed to make it fascinating!

alainherreman
Автор

I haven't watched the video yet but I just wanted to say I really appreciate you taking the time to teach this. I loved all of your previous videos. You are an excellent teacher and your style makes it super easy for me to learn.

sebah
Автор

Thank you very much for this. I'm from a traditional programming background, and your videos on AI are really excellent. I have followed the previous sessions, being able to implement the things correctly, and compare some of the stuff, with your results. That is awesome! Good luck.

arildboes