ChatGPT Position and Positional embeddings: Transformers & NLP 3

preview_player
Показать описание
In natural language processing, understanding the order of words in a sentence is crucial for comprehending its meaning. This is where positional embeddings come in. Positional embeddings allow transformer models to understand the relative and absolute position of each word in a sentence, which improves the overall understanding and representation of the sentence.

In this video we look at several examples to demonstrate the importance of encoding position. One key example consists of different sentences with the same words, but with the word ‘only’ in different positions. The changed position of the word ‘only’ changes the meaning of each sentence. This illustrates how crucial the understanding of the order of words can be in comprehending the meaning of a sentence.

To encode position we compare uses of a simple one-hot encoding and a more complex periodically varying function based on sine and cosine waves. One-hot encoding is a technique that represents the position of the word in a sentence as a unique vector. However, it doesn’t take into account the relative position of words to each other. This is where the periodically varying function comes in. It uses trigonometry functions to encode the relative position of words to each other, greatly improving the overall understanding of the sentence.

The use of trigonometry functions in positional embeddings is also similar to the way humans represent time. We use a vector to represent time and the elements in that vector are periodic. Just like hours, minutes, and seconds have different frequencies, so do words in a sentence. The use of trigonometry functions in positional embeddings allows the transformer to understand the relative and absolute position of each word in the sentence, in the same way as we understand the relative and absolute position of time.

In conclusion, positional embeddings play a crucial role in natural language processing. They allow transformer models to understand the order of words in a sentence, improving overall understanding and representation of text & language.

=========================================================================
Link to introductory series on Neural networks:

Link to intro video on 'Backpropagation':

=========================================================================
Transformers are a type of artificial intelligence (AI) used for natural language processing (NLP) tasks, such as translation and summarisation. They were introduced in 2017 by Google researchers, who sought to address the limitations of recurrent neural networks (RNNs), which had traditionally been used for NLP tasks. RNNs had difficulty parallelizing, and tended to suffer from the vanishing/exploding gradient problem, making it difficult to train them with long input sequences.

Transformers address these limitations by using self-attention, a mechanism which allows the model to selectively choose which parts of the input to pay attention to. This makes the model much easier to parallelize and eliminates the vanishing/exploding gradient problem.

Self-attention works by weighting the importance of different parts of the input, allowing the AI to focus on the most relevant information and better handle input sequences of varying lengths. This is accomplished through three matrices: Query (Q), Key (K) and Value (V). The Query matrix can be interpreted as the word for which attention is being calculated, while the Key matrix can be interpreted as the word to which attention is paid. The eigenvalues and eigenvectors of these matrices tend to be similar, and the product of these two matrices gives the attention score.

=========================================================================

#ai #artificialintelligence #deeplearning #chatgpt #gpt3 #neuralnetworks #attention #attentionisallyouneed

#ai #artificialintelligence #neuralnetworks #chatgpt #gpt3 #machinelearning #deeplearning
Рекомендации по теме
Комментарии
Автор

You are leagues apart when it comes to explaining complex concepts! Thanks and please never stop :)

atulbhardwaj
Автор

What an awesome video. Probably the best video on the internet for positional encodings. Loved every bit of it.

NithinDinesh-lh
Автор

One of the best videos on LLM I've seen so far.

jeremylee
Автор

I've probably watched 50+ transformer related videos and this is the only one explaining positional encodings that make sense to me.

gnorts_mr_alien
Автор

You're so good at explaining complex things. Thanks!

AzimkhanYerzhan
Автор

Honestly these videos are superb, I’ve seen videos that are appalling that because of the stupid YouTube algorithm have got far more views. I honestly think Google and YouTube order to promote scams and bullshit artists rather than quality content like this. This channel will blow up I’m sure of it, this is easily one of the best AI channels out there.

StoutProper
Автор

Best explanation of Positional embedding.

snehotoshbanerjee
Автор

very detailed explanation. Loved it!! You're an amazing teacher

syedtahsin
Автор

bravo. I definitely appreciate the craftsmanship of the graphics

jonclement
Автор

Hola, nice explanation, Waiting for next video!!

hemanthyernagula
Автор

Thanks a lot for your video. There is one moment I didn't understand. If I understood you right, you are stating that the distance between two vectors before and after adding one-hot encoding won't change. However, if the original vectors, for example, are [1, 2, 3] and [10, 20, 30], and after encoding they change to [2, 2, 3] and [10, 20, 31], the distance will change.

VKMaxim
Автор

Great content ! one question here relative postions between words word 2 and word 3 might be same as word 2 and word 50 lets say as sine and cosine can repeat, how does this add up?

ThinAirElon
Автор

Amazing content, thank you! I got a bit lost in transitions from one hot encoding, to sin cos formulas, to graphs, and back to values in a matrix. Would be great to eg indicate particular values on a graph and how where they slot in the matrix. Also, it would be great to see a numerical example of how Euclidian distances between words change depending on position in the sequence.

greengraycolor
Автор

10:15 - This is not true, though.

Eucledean distance does not only depend on the lengths of the vectors added, but also on the angles between the added encoding vectors and the original embedding vectors, which won't be the same if words are swaped. That can easily be checked with a direct computation:

In the first case the distance between vectors corresponding to words "swaps" and "are" is equal to
√[(-35.65-19.66)² + (59.47+61.65)² + (35.25-34.55)² + (-21.78-88.36)² + (33.44-50.35)² ] = 173.627
while in the second case it equals
√[(-36.65-20.66)² + (60.47+62.65)² + (35.25-34.55)² + (-21.78-88.36)² + (33.44-50.35)² ] = 175.671

So with one-hot positional encoding the distances just as well depend on the positions of words in a sentence.
The reason for not using one-hot encodings for positions is actually a completely different one.

Eltaurus
Автор

Hi there. Thanks again for a lovely video. I have a question
I have understood semtanic encoding well. I read a bit about word2vec and know how it works conceptually. In a vast 300 dimention space, there are vectors representing each word. This is understood. Adding the values of positional encoding here don't change semantic relationship. This is fine and understood as well.

The question is: At what point in the training does positional encoding come into the picture? Positional encoding for the words in your example were different for different sentences. For example the final vector represetnation for "are" and "swaps" were different in the different sentences.... how are the positional encodings saved in the 300 dimension space for say word2vec? How is this positional info stored for each word so that the transformer calls upon it when needed. Are we saying that along with every word, we have smaller vectors representing it's positions in the word2vec dataset?

Or is this something which is calculated only at the time of giving the transformer an input and while getting an output?

delhiboy
Автор

Intuitively, this seems like a potentially lossy/non-smooth embedding as it seems that very different phrases might be placed very close to each other in such embedding. Are there any kind of scaling best practices (e.g normalization of semantic encoding before adding position)? Also what if we just concatenate semantic matrix with position (so effectively keep them separate)?

DdesideriaS
Автор

Could you pls briefly explain your statement at 10:33 why the distance between the vector representing the word 'swap' and the vector representing the word 'derivatives' will always be the same? Is the distance you're referring to the one that's calculated before adding positional encoding or after?

tipycalflow
Автор

8:43 Me: What seems strange about the following sentence? "We'll start with a simple sentence, swaps are interest rate derivatives.". ChatGPT: The sentence "We'll start with a simple sentence, swaps are interest rate derivatives." may seem strange because it is not a simple sentence, but rather a complex one. Swaps, which are interest rate derivatives, are not a simple subject to explain and require a more in-depth understanding of financial markets and financial instruments. Additionally, the sentence is structured in a way that makes it appear as though the speaker is trying to simplify the topic, but in reality, the subject matter is quite complex.

messapatingy
Автор

To, the creator of this video.
Please tell me you have a 40 hours course on Udemy about AI/ML. If not please do so.

Dxeus
Автор

Sir, bouncing animations don't look good!

varunahlawat