NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT

preview_player
Показать описание
CORRECTION:
00:34:47: that should be "each a dimension of 12x4"

Transformers have revolutionized deep learning. In this module, we'll learn how they work in detail and build one from scratch. We'll then explore how to leverage state-of-the-art models for our projects through pre-training and transfer learning. We'll learn how to fine-tune models from Hugging Face and explore the capabilities of GPT from OpenAI. Along the way, we'll tackle a new task for this course: question answering.

Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"

This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.

Рекомендации по теме
Комментарии
Автор

Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"

futuremojo
Автор

Why did it take so long for YouTube to show this channel when I searched transformers? YouTube algorithm really needs to get better. This is really quality content. Well structured and clearly explained.

wryltxw
Автор

Honestly, best explanation ever. I’m a data scientist (5 year’s experience) and I was struggling to understand in depth how transformers are trained. Come across this video and booom problem solved. Cheers mate. Propose to whole company to see this video

johnmakris
Автор

What was provided: A high quality, easily digestible, and calm introduction to Transformers that could take almost anyone from zero to GPT in a single video.
What I got: It will probably take me longer than I'd like to get good at martial arts.

novantha
Автор

I spent 2 days trying to understand the paper “Attention is all you need” but lots of thing were implicit in the article. Thank you for making it crystal clear. This is the best video I saw about transformer

id-icou
Автор

God knows how many times I've banged my head aginst the wall...just to understand it... through different videos...this is the best one so far...🙏🏻

kaustubhkapare
Автор

Thank you so much! It covered transformers and beyond at several different levels, not only just coding but also fine-tuning, usage and more. That's really helpful. Thank you!

mtlee
Автор

I want to express my sincere gratitude for your excellent teaching and guidance in this State-of-the-art NLP course. Thank you Sir.

mahmoudreda
Автор

This is really high quality content. Why did it take so long for YouTube to recommend this.

wryltxw
Автор

I usually don’t comment on YouTube videos but couldn’t skip this. This is the BEST NLP course I’ve seen anywhere online. THANK YOU. ❤

abc
Автор

Thank you so much! It covered transformers at several different levels, not only coding but also fine-tuning, usage and more. That's really helpful. That's comprehensive and really helpful. Thank you!

mtlee
Автор

Fantastic explanation. Very detailed, slow paced, and straightforward.

JBoya
Автор

Thank you, legend for your exceptional teaching style !! 👏👏👏

If someone looking for a bit further explanation how to pass Q, K, and V matrices to the multi-head cross-attention layer in the decoder module:

Specifically, the key vectors are obtained by multiplying the encoder outputs with a learnable weight matrix, which transforms the encoder outputs into a matrix with a shape of (sequence_length, d_model). The value vectors are obtained by applying another learnable weight matrix to the encoder outputs, resulting in a matrix of the same shape.

The resulting key and value matrices can then be used as input to the multi-head cross-attention layer in the decoder module. The query vector, which is the input to the layer from the previous layer in the decoder, is also transformed using another learnable weight matrix to ensure compatibility with the key and value matrices.

The attention mechanism then computes attention scores between the query vector and the key vectors, which are used to compute attention weights. The attention weights are used to compute a weighted sum of the value vectors, which is then used as input to the subsequent layers in the decoder.

In summary, the key and value vectors are obtained by applying learnable weight matrices to the encoder outputs, and are used in the multi-head cross-attention mechanism of the decoder to compute attention scores and generate the output sequence.

RajkumarDarbar
Автор

WOW, your level of comprehension and presentation of your subject is the best I've ever seen. You are the best. thank you very much ❤❤❤

mazenlahham
Автор

Definitely my go-to video to understand and reference to anyone how Transformers work! Thanks Nitin!

anrichvanderwalt
Автор

This is amazing, I can’t thank you enough. I only wish this was around sooner. Keep up the great work!

AIShipped
Автор

Just completed the entire playlist. It was an absolute delight to watch, this last lecture was a favorite of mine because of you explained it in the form of a story. Thank you so much for sharing this knowledge with us and hope to learn more from you :D

weeb
Автор

Thank you. I had problems visualising this concept before watching the video because not much explanations/reasons were given to why things were done the way they were

mage
Автор

Perfect video for understanding Transformers . Its just Perfect!!! 👌👌👌👌👌👏👏👏👏👏

priyalgeorge
Автор

This is a remarkable piece of work. Beyond excellent!

srinathkumar