filmov
tv
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
Показать описание
Full coding of LLaMA 2 from scratch, with full explanation, including Rotary Positional Embedding, RMS Normalization, Multi-Query Attention, KV Cache, Grouped Query Attention (GQA), the SwiGLU Activation function and more!
I explain the most used inference methods: Greedy, Beam Search, Temperature Scaling, Random Sampling, Top K, Top P
I also explain the math behind the Rotary Positional Embedding, with step by step proofs.
Prerequisites:
Chapters
00:00:00 - Introduction
00:01:20 - LLaMA Architecture
00:03:14 - Embeddings
00:05:22 - Coding the Transformer
00:19:55 - Rotary Positional Embedding
01:03:50 - RMS Normalization
01:11:13 - Encoder Layer
01:16:50 - Self Attention with KV Cache
01:29:12 - Grouped Query Attention
01:34:14 - Coding the Self Attention
02:01:40 - Feed Forward Layer with SwiGLU
02:08:50 - Model weights loading
02:21:26 - Inference strategies
02:25:15 - Greedy Strategy
02:27:28 - Beam Search
02:31:13 - Temperature
02:32:52 - Random Sampling
02:34:27 - Top K
02:37:03 - Top P
02:38:59 - Coding the Inference
I explain the most used inference methods: Greedy, Beam Search, Temperature Scaling, Random Sampling, Top K, Top P
I also explain the math behind the Rotary Positional Embedding, with step by step proofs.
Prerequisites:
Chapters
00:00:00 - Introduction
00:01:20 - LLaMA Architecture
00:03:14 - Embeddings
00:05:22 - Coding the Transformer
00:19:55 - Rotary Positional Embedding
01:03:50 - RMS Normalization
01:11:13 - Encoder Layer
01:16:50 - Self Attention with KV Cache
01:29:12 - Grouped Query Attention
01:34:14 - Coding the Self Attention
02:01:40 - Feed Forward Layer with SwiGLU
02:08:50 - Model weights loading
02:21:26 - Inference strategies
02:25:15 - Greedy Strategy
02:27:28 - Beam Search
02:31:13 - Temperature
02:32:52 - Random Sampling
02:34:27 - Top K
02:37:03 - Top P
02:38:59 - Coding the Inference
Комментарии