Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention

preview_player
Показать описание

Professor Christopher Manning, Stanford University, Ashish Vaswani & Anna Huang, Google

Professor Christopher Manning
Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science
Director, Stanford Artificial Intelligence Laboratory (SAIL)


0:00 Introduction
2:07 Learning Representations of Variable Length Data
2:28 Recurrent Neural Networks
4:51 Convolutional Neural Networks?
14:06 Attention is Cheap!
16:05 Attention head: Who
16:26 Attention head: Did What?
16:35 Multihead Attention
17:34 Machine Translation: WMT-2014 BLEU
19:07 Frameworks
19:31 Importance of Residuals
23:26 Non-local Means
26:18 Image Transformer Layer
30:56 Raw representations in music and language
37:52 Attention: a weighted average
40:08 Closer look at relative attention
42:41 A Jazz sample from Music Transformer
44:42 Convolutions and Translational Equivariance
45:12 Relative positions Translational Equivariance
50:21 Sequential generation breaks modes.
50:32 Active Research Area

#naturallanguageprocessing #deeplearning
Рекомендации по теме
Комментарии
Автор

This is what makes Stanford great. The guy giving the lecture is the guy who actually invented the technique.

irenejenna
Автор

Oh my God! He is the father of modern AI and Machine Learning. Changed the world forever with that model used for chatGPT.

ersandyu
Автор

Amazing...Vaswani delivered this lecture in 2019. Became a legend in 2022 in tech community. Given his influence, he should get much more credit. If there is any Nobel prize in this field, he deserves it given his universal impact. 💪🏅

ARATHI
Автор

This person changed the world. He is the one behind the AI revolution. ❤

oj_simpson
Автор

"inductive biases you actually care about" - what an understatement!

conscience
Автор

That's the legend standing right there, the guy who invented it !

ikichiziki
Автор

GPT's way of explaining Self-Attention - "Let's imagine you are working on a story, and you have a bunch of sentences or ideas written down on separate cards. Each card represents a word or a piece of information.

Now, to make your story more interesting, you want to see how each idea or word is related to the others. That's where self-attention comes in!

Self-attention is like having a group of friends who help you decide which parts of your story are most important and how they connect to each other.

Here's how it works: Each card has three special markers: a question marker, a key marker, and a value marker. These markers help your friends understand what each card is about.

First, your friends take one card and look at its question marker. It's like the friend is asking, "What's important about this word or idea?"

Then, they compare the question marker to the key markers on all the other cards. It's as if they are looking for similarities or connections between the different parts of the story.

When they find a match, they give that card a score based on how related it is to the question. The higher the score, the more important it is!

After scoring all the cards, your friends use the scores to decide how much attention each card should get. It's like giving each card a special weight or importance based on the connections it has with the others.

Finally, they take the value marker on each card and add up all the values, giving more weight to the cards with higher scores. This creates a new version of your story where each word or idea has been improved based on the attention it received.

This process is repeated several times, with your friends going through the cards and adjusting the attention each time. This helps your friends understand more about how the different parts of your story fit together.

So, self-attention is like having friends who pay attention to each part of your story, figure out how they are connected, and make sure everything flows smoothly. It helps make your story more interesting and helps the model understand how words and ideas relate to each other in a text."

gochaitu
Автор

I would be scared for my life if I was Ashish. Someone from the future might travel back in time and kill me.

grownupgaming
Автор

using midi notes as input to the same kind of model generating actual notes pleasing to humans is an insane level capability. This is so next level

Ashish
Автор

Here is the summary
1.39 Looking for structure in data set
2.10 Variable length data
2.43 Primary workhorse upto date in RNN
(How many people knows RNN, laughter…If you don’t know what RNN is how can you follow this lecture)
3.49 Recommends a book by Oliver Sefridge, Pandominiums
3.60 RNNs limit parallelization
5.00 Precursor to self attention – Convolutional sequential models
7.10 Compare words and make comparisons (explaining self attention)
8.10 Multiplicative models are needed
9.19 Attention uses within the confines of RNNs
9.53 Transformer model explanation
10.43 Encoder
11.59 Encodr/decoder mechanism
12.29 Encoder self attention explanation, self attention
14.24. Attention mechanism is quadratic. It involves two matrix multiplication
14.50. Attention is attractive when dimension is larger than length
15.09 attention is faster
15.22 convolutions
16.00 why one attention layer is not enough.
(multi head attention)
16.50. Attention layeras a feature detector
21.50. Self attention for images
24.28. Image transformer
Replacing words with pixels
27.00 rasterization
30.00. next lecture. Music

RARa
Автор

key points:

- **[**00:04**]** Introduction of two invited speakers, Ashish Vaswani and Anna Huang, who will discuss self-attention in generative models and its applications, especially in music.

- **[**01:00**]** Ashish Vaswani discusses self-attention, focusing on its application beyond specific models, and its role in understanding the structure and symmetries in datasets.

- **[**01:58**]** The talk shifts to learning representations of variable length data, underlining the importance of representation learning in deep learning.

- **[**02:26**]** Discussion on recurrent neural networks (RNNs) as the traditional models for sequence data, and their limitations in terms of parallelization and capturing long-term dependencies.

- **[**04:17**]** Examination of the advantages of self-attention over RNNs, particularly in handling large datasets and efficiently summarizing information.

- **[**05:14**]** Comparison between self-attention and convolutional sequence models, highlighting the parallelization benefits and efficient local dependencies handling in the latter.

- **[**06:36**]** Introduction of the idea to use attention for representation learning, leading to the development of the transformer model.

- **[**07:34**]** Explanation of how self-attention properties aid text generation, particularly in machine translation.

- **[**08:58**]** Background on previous works related to self-attention and its evolution leading to the transformer model.

- **[**10:23**]** Description of the transformer model architecture, emphasizing its components like encoder, decoder, and positional representations.

- **[**12:17**]** Technical breakdown of the attention mechanism and its computational advantages, including speed and simplicity.

- **[**13:43**]** Discussion on the efficiency of attention mechanism and its comparative performance against RNNs and convolutions.

- **[**15:02**]** Analysis of how attention mechanisms can simulate convolutions and their application in language processing, particularly in understanding hierarchical structures.

- **[**17:24**]** Results of applying the transformer model to machine translation, demonstrating significant improvements over previous models.

- **[**19:18**]** Introduction of the concept of residual connections in transformers and their role in maintaining positional information.

- **[**21:12**]** Exploration of self-attention in modeling repeating structures in images and music, showcasing its versatility beyond text.

- **[**25:27**]** Discussion on adapting self-attention for image modeling, addressing the challenges and solutions for handling large image datasets.

- **[**30:43**]** Transition to Anna Huang's segment on applying self-attention in music generation, explaining the methodology and underlying principles.

- **[**34:53**]** Demonstration of the music transformer's capabilities, highlighting its effectiveness in maintaining coherence over longer sequences.

- **[**38:19**]** Discussion on the limitations of traditional attention models and the introduction of positional sinusoids for maintaining sequence structure.

- **[**40:16**]** In-depth explanation of relative attention and its benefits in handling long sequences, particularly in translation and music.

- **[**44:35**]** Insights into the applications of relative attention in images, focusing on its ability to achieve translational equivariance, a key property in image processing.

- **[**46:30**]** Exploration of relative attention in graph-based problems and its connection to message passing neural networks.

- **[**48:52**]** Summary of the benefits of self-attention, including modeling self-similarity, translation equivariance, and applications in graphs.

- **[**51:42**]** Insights into the application of self-attention in transfer learning, scaling up models, and their utility in self-supervised learning and multitasking.

This summary encapsulates the key topics and insights from the lecture.

labsanta
Автор

Interesting how the high note of this presentation was music, little did they knew that all they need was just to scale the LM more... a lot more.

But in all honesty, I don't think any of the authors of the paper believe that the "self-attention" mechanism is the only missing piece of the puzzle (AGI or whatever you call it), and no amount of data fed to the model will supplant that.

juanandrade
Автор

He is the true creator of chatGPT. leave it to the americans to rebrand something and put a dollar sign on it.. smh

laodrofotic
Автор

He is from BIT MESRA(Birla Institute of Technology Mesra), Ranchi !!! Proud Alumnni of our college.

NishantSharmachannel
Автор

This is cool. The inventor of transformers

danilo_
Автор

damn the guy who created modern AI damn.

aimatters
Автор

He is an Indian by origin. Very sad that the main author of attention is all you need was never known!!.

GenAIWithNandakishor
Автор

I am listening to Ashish Vaswani’s fake / imitation accent. It’s amusing in a way. There are still huge gaps in how he pronounces words that belie his true (Indian) accent. Pay attention to how words are spoken here Ashish. You can’t learn without paying attention!

TrollMeister_
Автор

his explanation isn't clear, his mind is.

omarrandoms