filmov
tv
(ViT) An Image Is Worth 16x16 Words | Paper Explained

Показать описание
New video about Vision Transformer(ViT) on my channel. As more flexible architecture, Transformers completely overtook the NLP field, but because of the quadratic cost of attention mechanism its application to Computer Vision field remained limited. “An Image is Worth 16x16 Words” paper is the first successful application of Transformers that beat the previous state-of-the-art results in image classification task. So in this video, I will explain concepts like:
- the motivation behind transformers in Computer Vision
- how the ViT works
- results they achieved
- the effect of model and dataset size on the performance of ViT
- the future directions and limitations of this models
In the next video I will show the PyTorch code with pre-trained weights, so stay tuned for that!
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Paper:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Medium article about Weight Standardisation and Group Normalization:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Connect with me on:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Timestamps:
0:00 Introduction
1:20 Transformer Pros/Cons
2:34 CNNs Pros/Cons
3:36 Related work
4:49 How ViT works?
6:24 Training process
7:48 Results
9:21 Insights from the results
12:20 Conclusions
- the motivation behind transformers in Computer Vision
- how the ViT works
- results they achieved
- the effect of model and dataset size on the performance of ViT
- the future directions and limitations of this models
In the next video I will show the PyTorch code with pre-trained weights, so stay tuned for that!
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Paper:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Medium article about Weight Standardisation and Group Normalization:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Connect with me on:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Timestamps:
0:00 Introduction
1:20 Transformer Pros/Cons
2:34 CNNs Pros/Cons
3:36 Related work
4:49 How ViT works?
6:24 Training process
7:48 Results
9:21 Insights from the results
12:20 Conclusions
An image is worth 16x16 words: ViT | Vision Transformer explained
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)
Introduction to Vision Transformer (ViT) | An image is worth 16x16 words | Computer Vision Series
(ViT) An Image Is Worth 16x16 Words | Paper Explained
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Vision Transformer (ViT) - An Image is Worth 16x16 Words: Transformers for Image Recognition
ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation
ViT: An Image is Worth 16x16 Words Explained
ViT (Vision Transformer) - An Image Is Worth 16x16 Words (Paper Explained)
Vision Transformers Explained | The ViT Paper
Vision Transformer for Image Classification
Vision Transformer (ViT) Paper Explained
#19 ViT: An Image is Worth 16x16 Words
Vision Transformer(ViT) - Image is worth 16x16 words | Paper Explained
[Vision Transformer] An Image is Worth 16 x 16 Words : Transformer for Image Recognition at Scale
Discover Vision Transformer (ViT) Tech in 2023
Transformers are outperforming CNNs in image classification
Vision Transformer ViT - Overview
Explicando 'An image is worth 16x16 words: ViT'
The Vision Transformer Model (ViT)
ViT: An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale
[Paper Review] ViT: An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale
Explicando 'An image is worth 16x16 words: ViT'
Visualization of embeddings with PCA during machine learning (fine-tuning) of a Vision Transformer
Комментарии