NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT

Показать описание

CORRECTION:
00:34:47: that should be "each a dimension of 12x4"

Transformers have revolutionized deep learning. In this module, we'll learn how they work in detail and build one from scratch. We'll then explore how to leverage state-of-the-art models for our projects through pre-training and transfer learning. We'll learn how to fine-tune models from Hugging Face and explore the capabilities of GPT from OpenAI. Along the way, we'll tackle a new task for this course: question answering.

Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"

This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.

Рекомендации по теме

Комментарии

Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"

futuremojo

Why did it take so long for YouTube to show this channel when I searched transformers? YouTube algorithm really needs to get better. This is really quality content. Well structured and clearly explained.

wryltxw

Honestly, best explanation ever. I’m a data scientist (5 year’s experience) and I was struggling to understand in depth how transformers are trained. Come across this video and booom problem solved. Cheers mate. Propose to whole company to see this video

johnmakris

What was provided: A high quality, easily digestible, and calm introduction to Transformers that could take almost anyone from zero to GPT in a single video.
What I got: It will probably take me longer than I'd like to get good at martial arts.

novantha

I spent 2 days trying to understand the paper “Attention is all you need” but lots of thing were implicit in the article. Thank you for making it crystal clear. This is the best video I saw about transformer

id-icou

God knows how many times I've banged my head aginst the wall...just to understand it... through different videos...this is the best one so far...🙏🏻

kaustubhkapare

Thank you so much! It covered transformers and beyond at several different levels, not only just coding but also fine-tuning, usage and more. That's really helpful. Thank you!

mtlee

I want to express my sincere gratitude for your excellent teaching and guidance in this State-of-the-art NLP course. Thank you Sir.

mahmoudreda

This is really high quality content. Why did it take so long for YouTube to recommend this.

wryltxw

I usually don’t comment on YouTube videos but couldn’t skip this. This is the BEST NLP course I’ve seen anywhere online. THANK YOU. ❤

abc

Thank you so much! It covered transformers at several different levels, not only coding but also fine-tuning, usage and more. That's really helpful. That's comprehensive and really helpful. Thank you!

mtlee

Fantastic explanation. Very detailed, slow paced, and straightforward.

JBoya

Thank you, legend for your exceptional teaching style !! 👏👏👏

If someone looking for a bit further explanation how to pass Q, K, and V matrices to the multi-head cross-attention layer in the decoder module:

Specifically, the key vectors are obtained by multiplying the encoder outputs with a learnable weight matrix, which transforms the encoder outputs into a matrix with a shape of (sequence_length, d_model). The value vectors are obtained by applying another learnable weight matrix to the encoder outputs, resulting in a matrix of the same shape.

The resulting key and value matrices can then be used as input to the multi-head cross-attention layer in the decoder module. The query vector, which is the input to the layer from the previous layer in the decoder, is also transformed using another learnable weight matrix to ensure compatibility with the key and value matrices.

The attention mechanism then computes attention scores between the query vector and the key vectors, which are used to compute attention weights. The attention weights are used to compute a weighted sum of the value vectors, which is then used as input to the subsequent layers in the decoder.

In summary, the key and value vectors are obtained by applying learnable weight matrices to the encoder outputs, and are used in the multi-head cross-attention mechanism of the decoder to compute attention scores and generate the output sequence.

RajkumarDarbar

WOW, your level of comprehension and presentation of your subject is the best I've ever seen. You are the best. thank you very much ❤❤❤

mazenlahham

Definitely my go-to video to understand and reference to anyone how Transformers work! Thanks Nitin!

anrichvanderwalt

This is amazing, I can’t thank you enough. I only wish this was around sooner. Keep up the great work!

AIShipped

Just completed the entire playlist. It was an absolute delight to watch, this last lecture was a favorite of mine because of you explained it in the form of a story. Thank you so much for sharing this knowledge with us and hope to learn more from you :D

weeb

Thank you. I had problems visualising this concept before watching the video because not much explanations/reasons were given to why things were done the way they were

mage

Perfect video for understanding Transformers . Its just Perfect!!! 👌👌👌👌👌👏👏👏👏👏

priyalgeorge

This is a remarkable piece of work. Beyond excellent!

srinathkumar

NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT

NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT

5 concepts in transformer neural networks (Part 1)

NLP Demystified 11: Essential Training Techniques for Neural Networks

Transformers Neural Networks | NLP with Deep Learning | Deep Learning Tutorial | Edureka Live

Is ML converging to Transformer only?

Transformers in NLP | GeeksforGeeks

NLP Demystified 1: Introduction

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Transformer models and BERT model: Overview

How to create a TinyGPT model from scratch #ai #transformers #aiengineer

What Are Transformers In NLP? | What Are Transformers In Machine Learning? | Gen AI | Simplilearn

NLP Transformer model explained using Python - Part 2

NLP with Neural Networks & Transformers

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

Attention in transformers, visually explained | DL6

ROBERTA model tutorial | machine learning | deep learning | transformer models | NLP models

NLP Demystified 13: Recurrent Neural Networks and Language Models

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Natural Language Processing In 5 Minutes | What Is NLP And How Does It Work? | Simplilearn

Transformer training at a glance

The Evolution of Neural Networks for NLP: From LSTMs to Transformers

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

How to process transformer input?

Masked Language Models Vs Causal Language Models in NLP #Shorts