Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

preview_player
Показать описание
▬▬ Papers / Resources ▬▬▬

▬▬ Support me if you like 🌟

▬▬ Used Music ▬▬▬▬▬▬▬▬▬▬▬
Music from #Uppbeat (free for Creators!):
License code: SMTWRWLNGHZHH0OC

▬▬ Used Icons ▬▬▬▬▬▬▬▬▬▬

▬▬ Timestamps ▬▬▬▬▬▬▬▬▬▬▬
00:00 Introduction
00:16 ViT Intro
01:12 Input embeddings
01:50 Image patching
02:54 Einops reshaping
04:13 [CODE] Patching
05:35 CLS Token
06:40 Positional Embeddings
08:09 Transformer Encoder
08:30 Multi-head attention
08:50 [CODE] Multi-head attention
09:12 Layer Norm
09:30 [CODE] Layer Norm
09:55 Feed Forward Head
10:05 Feed Forward Head
10:21 Residuals
10:45 [CODE] final ViT
13:10 CNN vs. ViT
14:45 ViT Variants

▬▬ My equipment 💻
Рекомендации по теме
Комментарии
Автор

I've changed the output layer a bit... this:
self.head_ln = nn.LayerNorm(emb_dim)
self.head = + self.height/self.patch_size * self.width/self.patch_size) * emb_dim), out_dim))

Then in forward:

x = x.view(x.shape[0], int((1 + self.height/self.patch_size * self.width/self.patch_size) * x.shape[-1]))
out = self.head(x)

The downside is that you'll likely get a lot more overfitting, but without it the network was not really training at all.

JessSightler
Автор

This is very underrated channel. You deserve way more viewers!!

geekyprogrammer
Автор

Keep making content like this, I am sure you will get a very good recognition in the future. Thanks for such amazing content.

betabias
Автор

The best part of Vision transformers is inbuilt support interpretability as compared to CNN where we had to compute saliency maps.

VoltVipin_VS
Автор

You're awesome man!!! I clicked your video so fast, you're one of the my favorite AI youtubers. I work in the field and I think you have a wonderful ability of explaining complex concepts in your videos

hmind
Автор

Really great explanation. Nice visuals

florianhonicke
Автор

There was a error on your published code but not in the video.
attn_output, attn_output_weights = self.att(x, x, x)
It should be
attn_output, attn_output_weights = self.att(q, k, v)

Anyway, thanks for sharing the video and code base. It helped me a lot while learning ViT

gayanpathirage
Автор

Awesome man!! You code and explain with such simplicity.

hemanthvemuluri
Автор

This channel is amazing. Please continue making videos!

tenma
Автор

Nice video! However, I think it's incorrect that you would get separate vectors for the three channels? This is not how they do it in the paper; there they say that the number of patches is N = HW/P^2, where H, W and P is the height and width of the original image and (P, P) is the resolution of each patch, so the number of color channels doesn't affect the number of patches you get.

kristoferkrus
Автор

Thank you! Very clear and informative.

netanelmad
Автор

Awesome! Thanks for excellent explanation!

romanlyskov
Автор

Awesome video! But I wonder if you reverse the order of LayerNorm and Multi-Head Attention? I think the LayerNorm should be applied after Multi-Head Attention but your implementation apply the LayerNorm before it.

kitgary
Автор

Why are the positional embeddings learnable? It doesn't make sense to me

cosminpetrescu
Автор

Why use dropout with GeLU? Didn’t the GeLU paper specifically say one motivation for GeLU was to replace ReLU+dropout with a single GeLU layer?

xxyyzz
Автор

Isn't the embedding layer redundant? I mean we have then the projection matrices meaning that embedding + projection is a composition of two linear layers.

adosar
Автор

Hello, first of all great tutorial video. I've tried running provided code for training, but after ~400 epochs loss is still the same (~3.61) and model always predicts the same class. Do you have an idea what might be a problem with it?

KacperPaszkowski-sb
Автор

can you please make a video on how to perform inference on VIT like googles open source vision transformer?

efexzium
Автор

Ah, tough to understand, guess will have to read more on this to fully understand

newbie
Автор

Hope you could explain the swim transformer object detection in new video please

murphy