Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Показать описание

▬▬ Papers / Resources ▬▬▬

▬▬ Support me if you like 🌟

▬▬ Used Music ▬▬▬▬▬▬▬▬▬▬▬
Music from #Uppbeat (free for Creators!):
License code: SMTWRWLNGHZHH0OC

▬▬ Used Icons ▬▬▬▬▬▬▬▬▬▬

▬▬ Timestamps ▬▬▬▬▬▬▬▬▬▬▬
00:00 Introduction
00:16 ViT Intro
01:12 Input embeddings
01:50 Image patching
02:54 Einops reshaping
04:13 [CODE] Patching
05:35 CLS Token
06:40 Positional Embeddings
08:09 Transformer Encoder
08:30 Multi-head attention
08:50 [CODE] Multi-head attention
09:12 Layer Norm
09:30 [CODE] Layer Norm
09:55 Feed Forward Head
10:05 Feed Forward Head
10:21 Residuals
10:45 [CODE] final ViT
13:10 CNN vs. ViT
14:45 ViT Variants

▬▬ My equipment 💻

Рекомендации по теме

Комментарии

I've changed the output layer a bit... this:
self.head_ln = nn.LayerNorm(emb_dim)
self.head = + self.height/self.patch_size * self.width/self.patch_size) * emb_dim), out_dim))

Then in forward:

x = x.view(x.shape[0], int((1 + self.height/self.patch_size * self.width/self.patch_size) * x.shape[-1]))
out = self.head(x)

The downside is that you'll likely get a lot more overfitting, but without it the network was not really training at all.

JessSightler

This is very underrated channel. You deserve way more viewers!!

geekyprogrammer

Keep making content like this, I am sure you will get a very good recognition in the future. Thanks for such amazing content.

betabias

The best part of Vision transformers is inbuilt support interpretability as compared to CNN where we had to compute saliency maps.

VoltVipin_VS

You're awesome man!!! I clicked your video so fast, you're one of the my favorite AI youtubers. I work in the field and I think you have a wonderful ability of explaining complex concepts in your videos

hmind

Really great explanation. Nice visuals

florianhonicke

There was a error on your published code but not in the video.
attn_output, attn_output_weights = self.att(x, x, x)
It should be
attn_output, attn_output_weights = self.att(q, k, v)

Anyway, thanks for sharing the video and code base. It helped me a lot while learning ViT

gayanpathirage

Awesome man!! You code and explain with such simplicity.

hemanthvemuluri

This channel is amazing. Please continue making videos!

tenma

Nice video! However, I think it's incorrect that you would get separate vectors for the three channels? This is not how they do it in the paper; there they say that the number of patches is N = HW/P^2, where H, W and P is the height and width of the original image and (P, P) is the resolution of each patch, so the number of color channels doesn't affect the number of patches you get.

kristoferkrus

Thank you! Very clear and informative.

netanelmad

Awesome! Thanks for excellent explanation!

romanlyskov

Awesome video! But I wonder if you reverse the order of LayerNorm and Multi-Head Attention? I think the LayerNorm should be applied after Multi-Head Attention but your implementation apply the LayerNorm before it.

kitgary

Why are the positional embeddings learnable? It doesn't make sense to me

cosminpetrescu

Why use dropout with GeLU? Didn’t the GeLU paper specifically say one motivation for GeLU was to replace ReLU+dropout with a single GeLU layer?

xxyyzz

Isn't the embedding layer redundant? I mean we have then the projection matrices meaning that embedding + projection is a composition of two linear layers.

adosar

Hello, first of all great tutorial video. I've tried running provided code for training, but after ~400 epochs loss is still the same (~3.61) and model always predicts the same class. Do you have an idea what might be a problem with it?

KacperPaszkowski-sb

can you please make a video on how to perform inference on VIT like googles open source vision transformer?

efexzium

Ah, tough to understand, guess will have to read more on this to fully understand

newbie

Hope you could explain the swim transformer object detection in new video please

murphy

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

An image is worth 16x16 words: ViT | Vision Transformer explained

Vision Transformer Basics

Vision Transformer for Image Classification

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Image Classification Using Vision Transformer | ViTs

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer in PyTorch

EfficientML.ai Lecture 14 - Vision Transformer (MIT 6.5940, Fall 2023)

Vision Transformer Attention

Discover Vision Transformer (ViT) Tech in 2023

Vision Transformer Explained

New TECH: Vision Transformer 2023 on Image Classification | AI

Classify Images with a Vision Transformer (ViT): PyTorch Deep Learning Tutorial

Vision Transformer - Keras Code Examples!!

EfficientML.ai Lecture 14 - Vision Transformer (MIT 6.5940, Fall 2023, Zoom)

Visual Guide to Transformer Neural Networks - (Episode 2) Multi-Head & Self-Attention

Vision Transformer and its Applications

PyTorch in 100 Seconds

What are Transformers (Machine Learning Model)?

Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!

Implement and Train ViT From Scratch for Image Recognition - PyTorch

Robust Perception with Vision Transformer SegFormer

Vision Transformer (ViT) Paper Explained