filmov
tv
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
![preview_player](https://i.ytimg.com/vi/Mn_9W1nCFLo/maxresdefault.jpg)
Показать описание
Full explanation of the LLaMA 1 and LLaMA 2 model from Meta, including Rotary Positional Embeddings, RMS Normalization, Multi-Query Attention, KV-Cache, Grouped Multi-Query Attention (GQA), the SwiGLU Activation function and more!
I also review the Transformer concepts that are needed to understand LLaMA and everything is visually explained!
Chapters
00:00:00 - Introduction
00:02:20 - Transformer vs LLaMA
00:05:20 - LLaMA 1
00:06:22 - LLaMA 2
00:06:59 - Input Embeddings
00:08:52 - Normalization & RMSNorm
00:24:31 - Rotary Positional Embeddings
00:37:19 - Review of Self-Attention
00:40:22 - KV Cache
00:54:00 - Grouped Multi-Query Attention
01:04:07 - SwiGLU Activation function
I also review the Transformer concepts that are needed to understand LLaMA and everything is visually explained!
Chapters
00:00:00 - Introduction
00:02:20 - Transformer vs LLaMA
00:05:20 - LLaMA 1
00:06:22 - LLaMA 2
00:06:59 - Input Embeddings
00:08:52 - Normalization & RMSNorm
00:24:31 - Rotary Positional Embeddings
00:37:19 - Review of Self-Attention
00:40:22 - KV Cache
00:54:00 - Grouped Multi-Query Attention
01:04:07 - SwiGLU Activation function
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
Rotary Positional Embeddings: Combining Absolute and Relative
The KV Cache: Memory Usage in Transformers
RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs
Llama - EXPLAINED!
Mistral Architecture Explained From Scratch with Sliding Window Attention, KV Caching Explanation
Transformer Architecture: Fast Attention, Rotary Positional Embeddings, and Multi-Query Attention
Extending Context Window of Large Language Models via Positional Interpolation Explained
Rotary Positional Embeddings
Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer
Attention is all you need (Transformer) - Model explanation (including math), Inference and Training
How to code long-context LLM: LongLoRA explained on LLama 2 100K
RoFormer: Enhanced Transformer with Rotary Position Embedding Explained
Inference Yarn Llama 2 13b 128k with KV Cache to answer quiz on very long textbook
Fast LLM Serving with vLLM and PagedAttention
Key Value Cache in Large Language Models Explained
Revamped Llama.cpp with Full CUDA GPU Acceleration and KV Cache for Fast Story Generation!
How a Transformer works at inference vs training time
Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)
Llama 2: Full Breakdown
StreamingLLM Lecture
Variants of Multi-head attention: Multi-query (MQA) and Grouped-query attention (GQA)
Llama 2 Paper Explained
Комментарии