LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Показать описание

Full explanation of the LLaMA 1 and LLaMA 2 model from Meta, including Rotary Positional Embeddings, RMS Normalization, Multi-Query Attention, KV-Cache, Grouped Multi-Query Attention (GQA), the SwiGLU Activation function and more!

I also review the Transformer concepts that are needed to understand LLaMA and everything is visually explained!

Chapters
00:00:00 - Introduction
00:02:20 - Transformer vs LLaMA
00:05:20 - LLaMA 1
00:06:22 - LLaMA 2
00:06:59 - Input Embeddings
00:08:52 - Normalization & RMSNorm
00:24:31 - Rotary Positional Embeddings
00:37:19 - Review of Self-Attention
00:40:22 - KV Cache
00:54:00 - Grouped Multi-Query Attention
01:04:07 - SwiGLU Activation function

Рекомендации по теме

Комментарии

As many of you have asked: LLaMA 2's architecture is made up of the ENCODER side of the Transformer plus a Linear Layer and a Softmax. It can also be thought of as the DECODER of the Transformer, minus the Cross-Attention. Generally speaking, people call a model like LLaMA a Decoder-only model, while a model like BERT an Encoder-only model. From now on I will also stick to this terminology for my future videos.

umarjamilai

Umar, Andrew Ng, 3Blue1Brown and Andrej are all you need.

You are one of the best educators of deep learning.

Thank you.

kqb

The best 1 hour I spent! I had so many questions exactly on all these topics and this video does an outstanding job at explaining enough details in an easy way!

mandarinboy

The network came in Feb 2023! This is a youtube channel worth subscribing. Thanks man

mojtabanourani

Amazing video ! Thanks for taking the time to explain core new concepts in language models

UnknownHuman

This video can be like official textbook of llama architecture. Amazing.

dgl

Thank you very much for your work. The community is blessed with such high quality presentations about difficult topics.

Paluth

Very underrated video. Thanks for providing such a good lecture to the community

TheMzbac

Fantastic explanation on LLaMA model. Please keep making this kind of videos.

Jc-jvwj

I became your fan in 55:00 when you explain how GPU capability drives the development. 🙂

muthukumarannm

This video, along with the previous ones about coding up transformers from scratch, are really outstanding. Thank you so much for taking such a tremendous amount of your free time to put all of this together!

ueendwp

Really glad I found your channel you create some of the most in depth and easy to follow explanations I've been able to find.

jordanconnolly

Such an amazing step by step breakdown of concepts involved! Thank you so much.

ravimandliya

My TLDR for the video (please point out the mistakes):

- LLaMA uses RMS normalization instead of LayerNorm because it provides the same benefits with less computation.
- LLaMA uses rotary embeddings. These act as a distance-based scaling to the original dot product scalar value coming out of queries and keys. In other words, two tokens, X and Y, will have a larger scalar value versus two tokens X and Y that are far apart. This makes sense from the point of view that closer tokens should have a bigger say in the final representation of a given token than the ones far away. This is not the case for vanilla transformer.
- LLaMA uses Grouped Query Attention as an alternative to vanilla attention mostly to optimize GPU Flops (and its much slow memory access). Key slide on 1:03:00. In vanilla attention, each token (within each head) has its own key, query and value vector. In multi-query attention (MQA), there is only one key and value vector for all query vectors. In between lies the MQA where a few query vectors (say 2-4) may be mapped to one key and value vector.
- LLaMA uses SwiGLU activation function since it works better
- LLaMA uses 3 layers instead of 2 for the FFNN part of the encoder block but keeps the number of parameters same.

siqb

Thanks for your free time and offering this valuable tutorial👏👏👏
Hope you keep going to do this, thanks again

tubercn

Thank you very much! I once read the paper, but I think watching your video provided me with more insights about this paper than reading it many more times would have.

Bestin

This intro to llama is awesome ❤, thank you for making such a great video.

cobaltl

The best machine learning videos I've ever watched. Thanks Umar!

librakevin

Amazing. Just wow. I cannot find this stuff on whole internet

Engrbilal

Fantastic video! Your explanations are very clear, thank you!

nfoyzwn

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Rotary Positional Embeddings: Combining Absolute and Relative

The KV Cache: Memory Usage in Transformers

RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs

Llama - EXPLAINED!

Mistral Architecture Explained From Scratch with Sliding Window Attention, KV Caching Explanation

Transformer Architecture: Fast Attention, Rotary Positional Embeddings, and Multi-Query Attention

Extending Context Window of Large Language Models via Positional Interpolation Explained

Rotary Positional Embeddings

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

How to code long-context LLM: LongLoRA explained on LLama 2 100K

RoFormer: Enhanced Transformer with Rotary Position Embedding Explained

Inference Yarn Llama 2 13b 128k with KV Cache to answer quiz on very long textbook

Fast LLM Serving with vLLM and PagedAttention

Key Value Cache in Large Language Models Explained

Revamped Llama.cpp with Full CUDA GPU Acceleration and KV Cache for Fast Story Generation!

How a Transformer works at inference vs training time

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

Llama 2: Full Breakdown

StreamingLLM Lecture

Variants of Multi-head attention: Multi-query (MQA) and Grouped-query attention (GQA)

Llama 2 Paper Explained