Rotary Positional Embeddings: Combining Absolute and Relative

preview_player
Показать описание

In this video, I explain RoPE - Rotary Positional Embeddings. Proposed in 2022, this innovation is swiftly making its way into prominent language models like Google's PaLM and Meta's LLaMa. I unpack the magic behind rotary embeddings and reveal how they combine the strengths of both absolute and relative positional encodings.

0:00 - Introduction
1:22 - Absolute positional embeddings
3:19 - Relative positional embeddings
5:51 - Rotary positional embeddings
7:56 - Matrix formulation
9:31 - Implementation
10:38 - Experiments and conclusion

References:

Рекомендации по теме
Комментарии
Автор

This is the clearest explanation for the RoPE embedding.

jeonghwankim
Автор

I've watched a few videos trying to wrap my head around this concept and yours is by far the best. Thanks!

theunconventionalenglishman
Автор

Thanks for creating and sharing this vid! Still confused on the math stuff though. So I read through the paper and wrote down some notes:

The rotation matrix Rm rotates a query vector q of the mth token by mθ, while Rn rotates a key vector k of the nth token by nθ. For any rotation matrix or orthogonal matrix R, R^T = R^-1 holds. Thus Rm^T is Rm's inverse, that rotates q in another direction by -mθ. This means (q·Rm)^T·(k·Rn) can in total rotate q^T·k by (n-m)θ. This ultimately associates the knowledge extracted from the mth query and the nth key with their relative distance n - m, naturally and interpretably.

cmbbqrpb
Автор

This is amazing, thank you
I just wrapped my mind around sinusoidal embeddings and came across rope and was really struggling to grasp it.
Definitely going to refer back to this video.
I love in depth NLP content like this.

ItsRyanStudios
Автор

Thank you for such an intuitive explanation of a pretty complex paper.

kindness_mushroom
Автор

Finally! My 4th video and I was lost but this one did the trick!

laurentiupetrea
Автор

Thank you so much. Your explanation is very clear and succinct.

hw
Автор

Amazing video, intuitive explanations with examples.

MrOnlineCoder
Автор

*Video Summary: Rotary Positional Embeddings: Combining Absolute and Relative*

- *Introduction*
- Discusses the importance of positional embeddings in Transformer models.

- *Absolute Positional Embeddings*
- Explains how absolute positional embeddings work.
- Highlights limitations like fixed sequence length and lack of relative context.

- *Relative Positional Embeddings*
- Introduces the concept of relative positional embeddings.
- Discusses the computational challenges and inefficiencies.

- *Rotary Positional Embeddings (RoPE)*
- Combines the advantages of both absolute and relative embeddings.
- Uses rotation to encode position, preserving relative distances.

- *Matrix Formulation*
- Explains the mathematical formulation behind RoPE.

- *Implementation*
- Shows how RoPE can be implemented efficiently in PyTorch.

- *Experiments and Conclusion*
- Shares results of experiments showing RoPE's effectiveness and efficiency compared to other methods.

The video provides a comprehensive overview of Rotary Positional Embeddings, a new method that combines the strengths of both absolute and relative positional embeddings. It delves into the mathematical details and practical implementation, concluding with experimental results that validate its effectiveness.

wolpumba
Автор

Great explanation. Thank you for making this.

sammcj
Автор

You make it easy to learn even for a high school student

vixguy
Автор

Good work, I look 'forward' to the ReRoPE video. 😎

marshallmcluhan
Автор

your explanation is amazing. thank you for your work

roomotime
Автор

Absolutely amazing explanation! Keep it up man

weekendwarrior
Автор

Nice video. Thanks for this. I could be wrong but one potential error I see:

In this video, you said that "You can’t do KV cache because you change the embeddings with every token you add." I don't think this is necessarily true, at least not for decoder architectures like GPTs. The previous tokens don't attend to the new tokens -- they only attend to tokens to their left (there's a causal mask). When you add a new token, the relative position between the previous tokens don't change. For example, if you add a 6th token to a sequence, the distance between token 1 and token 4 haven't changed at all; therefore, the KV cache is still valid.

It seems to me that yes, relative position embedding is inefficient, but not because it invalidates KV cache; rather, it's because every time we add a new token, it needs to attend to all previous tokens twice: once for the regular attention calculation, once for the relative positional embedding

garylai
Автор

Thanks for the in-depth explanation of RoPE. A couple of questions:

1. How is KV Cache used/built for RoPE case? RoPE is applied to q and K. Does this change anything in how K and V are cached?
2. Where can I find intuition behind why this RoPE works? I usually find it harder to jump into the mathematical equations directly to find the proof.

SahilDua
Автор

Thanks for the crisp explanation. But I'm curious to know the source of information at 7:36; I couldn't find the same in the paper. Can you share the source for more information?

manikantabandla
Автор

Gemini: The video is about a new method for positional embeddings in transformers called rotary positional embeddings.
The Transformer architecture is a neural network architecture commonly used for various natural language processing tasks. A key challenge for Transformer models is that they are invariant to the order of words by default. This means that the model would not be able to distinguish between a sentence and its scrambled version. To address this challenge, positional embeddings are added to the Transformer model. There are two main types of positional embeddings: absolute positional embeddings and relative positional embeddings.
Absolute positional embeddings assign a unique vector to each position in a sentence. This approach however, can not handle sentences longer than the training data. Relative positional embeddings, on the other hand, represent the relationship between two words. While this method can handle sentences of any length, it requires additional computations in the self-attention layer, making it less efficient.
Rotary positional embeddings address the limitations of both absolute and relative positional embeddings. The core idea is to rotate the word vector instead of adding a separate positional embedding vector. The amount of rotation is determined by the position of the word in the sentence. This way, rotary positional embeddings capture the absolute position of a word while also preserving the relative positions between words.
The video also mentions that rotary positional embeddings have been shown to improve the training speed of language models.░

gemini_
Автор

Thanks @Bai for the great explanation.
I still have a question: Mathematically, why will the positional embedding of other positional embedding techniques (may be absolute?) change if adding more tokens to the sentence? Approximately, at minutes 7:00 of this video.
Thanks!

abdelrahmanhammad
Автор

Thanks for a great explanation! By the way, I was curious when I understood from the initial explanation and the rotational equations, consecutive pairs of coordinates seem to be rotated, as in (x_1, x_2) / (x_3, x_4) ... are each rotated. However from most of the implementations as suggested in the video, the codes pair up not by adjacent indices but with a window size of half the dimension, which would be (x_1, x_d//2+1) / (x_2, x_d//2+2) ... since the code states that we split the hdim by half and swap their order.. did I understand correctly or am I missing something?

naubull