Classify Images with a Vision Transformer (ViT): PyTorch Deep Learning Tutorial

preview_player
Показать описание
TIMESTAMPS
00:00 Introduction
00:28 Overview of Vision Transformers
00:43 Reference to "An Image is Worth 16x16 Words" Paper
01:50 Comparison with CNNs
03:00 Explanation of Transformer Blocks
04:41 Network Implementation
05:18 Forward Pass
07:43 Model Instantiation
08:19 Training Process
08:52 Training Results
09:12 Significance of Vision Transformers
09:31 Visualization of Positional Embeddings
10:30 Future Directions and Conclusion

In this Pytorch Tutorial video I introduce the Vision Transformer model! By simply splitting our image into patches we can use Encoder-Only Transformers to perform image classification!

An Image is Worth 16x16 words:

Donations, Help Support this work!

The corresponding code is available here! ( Section 14)

Discord Server:
Рекомендации по теме
Комментарии
Автор

Hi, where can I find the original document that this was based on? Thank you!

laodrofotic
Автор

hey! great tutorial! I want to train an AI to play subway surfers. do you think using a vision transformer model would be the right approach? Like I would write a python program to label each frame of the game with the input action i took, and then feed that data into a ViT. maybe I could incorporate some sort of frame stacking so the AI can figure out temporal data?

or do you think an RL approach would be better? Although implementation would be tough as i dont see how i can use RL without recreating the whole game. I'd love your advice!

kushaagra