LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

preview_player
Показать описание
Full explanation of the LLaMA 1 and LLaMA 2 model from Meta, including Rotary Positional Embeddings, RMS Normalization, Multi-Query Attention, KV-Cache, Grouped Multi-Query Attention (GQA), the SwiGLU Activation function and more!

I also review the Transformer concepts that are needed to understand LLaMA and everything is visually explained!

Chapters
00:00:00 - Introduction
00:02:20 - Transformer vs LLaMA
00:05:20 - LLaMA 1
00:06:22 - LLaMA 2
00:06:59 - Input Embeddings
00:08:52 - Normalization & RMSNorm
00:24:31 - Rotary Positional Embeddings
00:37:19 - Review of Self-Attention
00:40:22 - KV Cache
00:54:00 - Grouped Multi-Query Attention
01:04:07 - SwiGLU Activation function
Рекомендации по теме
Комментарии
Автор

As many of you have asked: LLaMA 2's architecture is made up of the ENCODER side of the Transformer plus a Linear Layer and a Softmax. It can also be thought of as the DECODER of the Transformer, minus the Cross-Attention. Generally speaking, people call a model like LLaMA a Decoder-only model, while a model like BERT an Encoder-only model. From now on I will also stick to this terminology for my future videos.

umarjamilai
Автор

Umar, Andrew Ng, 3Blue1Brown and Andrej are all you need.

You are one of the best educators of deep learning.

Thank you.

kqb
Автор

The best 1 hour I spent! I had so many questions exactly on all these topics and this video does an outstanding job at explaining enough details in an easy way!

mandarinboy
Автор

The network came in Feb 2023! This is a youtube channel worth subscribing. Thanks man

mojtabanourani
Автор

Amazing video ! Thanks for taking the time to explain core new concepts in language models

UnknownHuman
Автор

This video can be like official textbook of llama architecture. Amazing.

dgl
Автор

Thank you very much for your work. The community is blessed with such high quality presentations about difficult topics.

Paluth
Автор

Very underrated video. Thanks for providing such a good lecture to the community

TheMzbac
Автор

Fantastic explanation on LLaMA model. Please keep making this kind of videos.

Jc-jvwj
Автор

I became your fan in 55:00 when you explain how GPU capability drives the development. 🙂

muthukumarannm
Автор

This video, along with the previous ones about coding up transformers from scratch, are really outstanding. Thank you so much for taking such a tremendous amount of your free time to put all of this together!

ueendwp
Автор

Really glad I found your channel you create some of the most in depth and easy to follow explanations I've been able to find.

jordanconnolly
Автор

Such an amazing step by step breakdown of concepts involved! Thank you so much.

ravimandliya
Автор

My TLDR for the video (please point out the mistakes):

- LLaMA uses RMS normalization instead of LayerNorm because it provides the same benefits with less computation.
- LLaMA uses rotary embeddings. These act as a distance-based scaling to the original dot product scalar value coming out of queries and keys. In other words, two tokens, X and Y, will have a larger scalar value versus two tokens X and Y that are far apart. This makes sense from the point of view that closer tokens should have a bigger say in the final representation of a given token than the ones far away. This is not the case for vanilla transformer.
- LLaMA uses Grouped Query Attention as an alternative to vanilla attention mostly to optimize GPU Flops (and its much slow memory access). Key slide on 1:03:00. In vanilla attention, each token (within each head) has its own key, query and value vector. In multi-query attention (MQA), there is only one key and value vector for all query vectors. In between lies the MQA where a few query vectors (say 2-4) may be mapped to one key and value vector.
- LLaMA uses SwiGLU activation function since it works better
- LLaMA uses 3 layers instead of 2 for the FFNN part of the encoder block but keeps the number of parameters same.

siqb
Автор

Thanks for your free time and offering this valuable tutorial👏👏👏
Hope you keep going to do this, thanks again

tubercn
Автор

Thank you very much! I once read the paper, but I think watching your video provided me with more insights about this paper than reading it many more times would have.

Bestin
Автор

This intro to llama is awesome ❤, thank you for making such a great video.

cobaltl
Автор

The best machine learning videos I've ever watched. Thanks Umar!

librakevin
Автор

Amazing. Just wow. I cannot find this stuff on whole internet

Engrbilal
Автор

Fantastic video! Your explanations are very clear, thank you!

nfoyzwn