(ViT) An Image Is Worth 16x16 Words | Paper Explained

Показать описание

New video about Vision Transformer(ViT) on my channel. As more flexible architecture, Transformers completely overtook the NLP field, but because of the quadratic cost of attention mechanism its application to Computer Vision field remained limited. “An Image is Worth 16x16 Words” paper is the first successful application of Transformers that beat the previous state-of-the-art results in image classification task. So in this video, I will explain concepts like:
- the motivation behind transformers in Computer Vision
- how the ViT works
- results they achieved
- the effect of model and dataset size on the performance of ViT
- the future directions and limitations of this models

In the next video I will show the PyTorch code with pre-trained weights, so stay tuned for that!

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Paper:

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Medium article about Weight Standardisation and Group Normalization:

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Connect with me on:

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Timestamps:
0:00 Introduction
1:20 Transformer Pros/Cons
2:34 CNNs Pros/Cons
3:36 Related work
4:49 How ViT works?
6:24 Training process
7:48 Results
9:21 Insights from the results
12:20 Conclusions

Рекомендации по теме

Комментарии

great video.
Have you heard about 'Simplifying Transformer Blocks'? they claim to have the same performance with 15% parameters less. Would be cool if you cover that

Asmonix

(ViT) An Image Is Worth 16x16 Words | Paper Explained

An image is worth 16x16 words: ViT | Vision Transformer explained

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Introduction to Vision Transformer (ViT) | An image is worth 16x16 words | Computer Vision Series

(ViT) An Image Is Worth 16x16 Words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An Image is Worth 16x16 Words: Transformers for Image Recognition

ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation

ViT: An Image is Worth 16x16 Words Explained

ViT (Vision Transformer) - An Image Is Worth 16x16 Words (Paper Explained)

Vision Transformers Explained | The ViT Paper

Vision Transformer for Image Classification

Vision Transformer (ViT) Paper Explained

#19 ViT: An Image is Worth 16x16 Words

Vision Transformer(ViT) - Image is worth 16x16 words | Paper Explained

[Vision Transformer] An Image is Worth 16 x 16 Words : Transformer for Image Recognition at Scale

Discover Vision Transformer (ViT) Tech in 2023

Transformers are outperforming CNNs in image classification

Vision Transformer ViT - Overview

Explicando 'An image is worth 16x16 words: ViT'

The Vision Transformer Model (ViT)

ViT: An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale

[Paper Review] ViT: An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale

Explicando 'An image is worth 16x16 words: ViT'

Visualization of embeddings with PCA during machine learning (fine-tuning) of a Vision Transformer