Reading SWIN transformer source code - Image Recognition with Transformers

preview_player
Показать описание
This video goes through the source code of Pytorch "vision" implementation of SWIN image recognition model.
This is not the original implementation of the paper, but rather, "torchvision" reimplementation that attempts to follow the original as close as possible and achieves the same results.
Important links:

00:00 - Intro
01:44 - Model Lineage and Versions
04:13 - Data Loading and Augmentations
11:51 - Overall Model Structure
25:32 - Stochastic Depth
32:56 - Shifted Window Attention
50:56 - Patch Merging Block
54:08 - Next Up
Рекомендации по теме
Комментарии
Автор

I have spent a week to understand the underling implementation of how SwinTransformers work. I have learnt so much from you. Really thanks so much.

ahmedbahgat
Автор

Nice! The roll and masking operations are basically the most important ones in swin - very useful concepts considering the difficulty of actually mapping the rolled windows back to each other. It can be very confusing, especially maintaining the outer windows which are partly appearing in completely different regions of the image after the roll.

It would also be cool if you could use the pre-trained weights so you can actually show the meaning of indermediate and final model outputs (like attention heatmaps or class probabilities) - this sometimes helps to capture the modules functionality🙂

davidro
Автор

Awesome! Learned about stochastic depth from the video.

vslaykovsky
Автор

my bro, my man. The series keep getting on fire 🔥

anhduy
Автор

Hi bro, can you please explain the paper "MaxViT: Multi-Axis Vision Transformer" and its code? Thank u in advance.

akramsalim