Let's build the GPT Tokenizer

Показать описание

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)

Exercises:

Links:

Supplementary links:

Andrej Karpathy

Рекомендации по теме

Комментарии

I'm amazed at the times we live in. One of the top AI experts in the world is sharing free tutorials that teaches technologies in great detail including examples and code. Thank you very much.

gustavojuantorena

You could have made a million bucks or spent time on a beach with a cocktail. Instead you chose to do this for the benefit of unknown strangers. You are an amazingly talented and generous man. Thank you so much!

SydneyPanda

I cannot express enough how grateful I am for your incredible tutorials. As a 44-year-old South Sudanese individual with three children and a full-time job, recently relocated from the UK to the US, my days are filled to the brim with responsibilities. However, I have made it a priority to utilize every ounce of my free time to follow your series of lectures. I must say, I have never been as excited about learning a new topic as I am now. Your clear explanations and engaging content have ignited a passion for learning within me that I never knew existed. Thank you from the bottom of my heart for all that you do.

kuoldeng

The fact that Andrej prefers being here with us over developing self driving or developing AGI is surely one of the great things about living in the present times.

jstello

This guy left Tesla, OpenAI, and God knows how many other companies so he can teach us how the state of art AI works, from scratch, clearly explained. Such a pure gem. Thank you very very very much.

MShahbazKharal

Whenever Karpathy leaves a job the ML world jumps forward a bit with his educational content

julianhunt

This is honestly better than OpenAI releasing a new model

DavidOndrej

When andrej uploads i immediately drop everything and watch the complete video. Thank you so much for your valuable content!

niceguysayshi

Andrej, content like this is advancing the industry more than most companies are. Thank you for educating us. You are pushing everybody forwards!

yoJuicy

the intrinsic value in andrej making videos for the world is huge. Please andrej keep doing this, no other jobs can output the alpha you're delivering here, pushing the boundaries of the best educational content on the planet. You are born for this.

entonbot

Thank you, Andrey! In an environment where 99% of conversations about LLM are marketing, technical videos are especially valuable and interesting!

Pythoncode-daily

Been waiting on the lecture to drop ever since we saw the Github repo go public.
Andrej, on behalf of the academic and professional communities, thank you for putting the effort into creating this open knowledge.
On a personal note, I've been following your work since the early OpenAI days, and your passion during keynotes/presentations has supported my choice of pursuing a PhD in the field.
Thank you Andrej.

TheMoefd

Feynman-class teaching abilities. Amazing. Thank you very much as always Andrej

Alilinpow

The best thing I have learnt from Andrej's videos is 'How to learn?' Break things down to axiomatic levels and then connect those pieces together! Its truly enlightening!

piper_of_the_dawn

Thank you Andrej Karpathy. Thanks for teaching Python libraries, GPT generator codes. Great work

LLMTokenizationID

Andrej, you're really good at teaching. We're lucky that you're spending your valuable time to prepare these free contents.

emranmohammadabuanas

OpenAI was such a closed cult, Andrej gave himself to the world.
🌹❤️🌹
Andrej deserves his flowers.

justjeremiah

Sorry Andrej, but you've missed the mark: you tell us that you don't like this topic, but you've still managed to make it fascinating!

alainherreman

I haven't watched the video yet but I just wanted to say I really appreciate you taking the time to teach this. I loved all of your previous videos. You are an excellent teacher and your style makes it super easy for me to learn.

sebah

Thank you very much for this. I'm from a traditional programming background, and your videos on AI are really excellent. I have followed the previous sessions, being able to implement the things correctly, and compare some of the stuff, with your results. That is awesome! Good luck.

arildboes

Let's build the GPT Tokenizer

Let's build the GPT Tokenizer

Let's build GPT: from scratch, in code, spelled out.

Building a new tokenizer

Let's reproduce GPT-2 (124M)

Create a Large Language Model from Scratch with Python – Tutorial

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Training a new tokenizer

Internationalization for RAG

Building a Compiler - Building a Long Overdue Tokenizer

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

Train your own language model with nanoGPT | Let’s build a songwriter

GPT from Scratch w/ MLX - Day 1 - IDE Setup & Tokenizer

The GPT Tokenizer: Byte Pair Encoding

Pre School 2024 Day 8 | Let's build the GPT Tokenizer

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models

NLP Made Easy: ChatGPT Tokenizer The Building Blocks Of Natural Language Processing

Building LLMs from the Ground Up: A 3-hour Coding Workshop

Understanding ChatGPT/OpenAI Tokens

How to use ChatGPT tokenizer in Python | OpenAI tokenizer | tiktoken

How ChatGPT Works Technically | ChatGPT Architecture

Sentence Tokenization in Transformer Code from scratch!

LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Transformers, explained: Understand the model behind GPT, BERT, and T5