Vision Transformer - Keras Code Examples!!

preview_player
Показать описание
This video walks through the Keras Code Example implementation of Vision Transformers!! I see this as a huge opportunity for graduate students and researchers because this architecture has a serious room for improvement. I predict that Attention will outperform CNN models like ResNets, EfficientNets, etc. it will just take the discovery of complimentary priors, e.g. custom data augmentations or pre-training tasks. I hope you find this video useful, please check out the rest of the Keras Code Examples playlist!

Content Links:

Chapters
0:00 Welcome to the Keras Code Examples!
0:45 Vision Transformer Explained
2:47 TensorFlow Add-Ons
3:29 Hyperparameters
7:04 Data Augmentations
8:30 Patch Construction
11:52 Patch Embeddings
14:01 ViT Classifier
16:30 Compile and Run
19:02 Analysis of Final Performance
Рекомендации по теме
Комментарии
Автор

Amazing, few people can even do this explanation line by line, great contribution democratizing AI knowledge!

artukikemty
Автор

-1 inside reshaping is a handy neat trick. Let's say you want to flatten a tensor of shapes (batch_size, 512, 16). You can easily do that by doing something like tf.reshape(your_tensor, (batch_size, -1)). You don't need to explicitly specify the flattened dimensions.

sayakpaul
Автор

It's so easy to implement ViT. Before I was afraid of using those big models because I thought it would be hard to implement, but keras and pytorch do have multiheadattention as a built-in function!

sz
Автор

When you specify `from_logits=True` softmax is first applied and then cross-entropy is taken.

sayakpaul
Автор

Hi, thanks for the video :)

In 10:35, I guess the -1 comes from the number of patches. Like if we say the batch_size=2 the output dimension of the tf.reshape function will be 2x144x108, since there are 144 patches inside the 72x72 image (patch_size=6). Also in the plotting loop, we are looping through the second dimension which has 144 element.

sinancalsr
Автор

Since TF 2.0 you can the regular plus (+) operator instead of the Add layer.

CristianGarcia
Автор

Cool job... For the "from_logits=True" part it expects only the logits (without the softmax activation) the SparseCategoricalCorssEntropy will apply softmax for you with that option...
Just be careful as, if people set from_logits to True and still apply the Softmax at the end of their network, it will apply the loss function(with the softmax) on an already probability distribution

DiogoSanti
Автор

Thank you very much for these amazing videos. Your contribution is key to the applications of these methods.

NehadHirmiz
Автор

hello thanks, but i want to ask a question,
in the input section(Extra learnable [class] embedding)
What is the zero (0) index used for and what information does it contain?

abdurrahmansefer
Автор

can we implement this ViT on our own dataset

yaswanth
Автор

Hello! Please, can you do a video on how to use Swin Transformer using an autoencoder architecture? Thank you in advance. I have a difficulty when I want to restore the patch into an image (for the decoder part)

annicetrazafindratovolahy
Автор

I second your thoughts on complementary priors. In fact, BotNets, IMO, are a step in that direction. DeIT as well.

sayakpaul
Автор

can you share the link to this notebook

sakibulislam
Автор

Can anybody explain this paragraph to me:

Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.

pakistanproud
Автор

Please i have my custom dataset with 3 folders than 3 classes how can i use the ViT please to do classification

khalladisofiane
Автор

maybe it's a silly question but does vit work on gray scale pic??

Bomerang
Автор

Your explanation is amazing, thank you very much, but I want to ask a question, what is the projection dimension and why it is 64 however the patches 144 per image and the index will be from 0 to 143?? thank you very much again for your attention?

وذكرفإنالذكرىتنفعالمؤمنين-قز
Автор

Hi,
Thank you for the explanation.
I have a question regarding the variable `position_dim`, how it was chosed? If i change the patch size, do I need to change that too?

sendjasniabderrezzaq
Автор

guys how to modify the code so i can use dataset from kaggle?

jason-ybqk
Автор

might be a stupid question, but how to visualize the attention? i honestly confused on extracting the attention

billiartag