(ViT) An Image Is Worth 16x16 Words | Paper Explained

preview_player
Показать описание
New video about Vision Transformer(ViT) on my channel. As more flexible architecture, Transformers completely overtook the NLP field, but because of the quadratic cost of attention mechanism its application to Computer Vision field remained limited. “An Image is Worth 16x16 Words” paper is the first successful application of Transformers that beat the previous state-of-the-art results in image classification task. So in this video, I will explain concepts like:
- the motivation behind transformers in Computer Vision
- how the ViT works
- results they achieved
- the effect of model and dataset size on the performance of ViT
- the future directions and limitations of this models

In the next video I will show the PyTorch code with pre-trained weights, so stay tuned for that!

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Paper:

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Medium article about Weight Standardisation and Group Normalization:

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Connect with me on:

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Timestamps:
0:00 Introduction
1:20 Transformer Pros/Cons
2:34 CNNs Pros/Cons
3:36 Related work
4:49 How ViT works?
6:24 Training process
7:48 Results
9:21 Insights from the results
12:20 Conclusions
Рекомендации по теме
Комментарии
Автор

great video.
Have you heard about 'Simplifying Transformer Blocks'? they claim to have the same performance with 15% parameters less. Would be cool if you cover that

Asmonix
join shbcf.ru