Vision Transformer - Keras Code Examples!!

Показать описание

This video walks through the Keras Code Example implementation of Vision Transformers!! I see this as a huge opportunity for graduate students and researchers because this architecture has a serious room for improvement. I predict that Attention will outperform CNN models like ResNets, EfficientNets, etc. it will just take the discovery of complimentary priors, e.g. custom data augmentations or pre-training tasks. I hope you find this video useful, please check out the rest of the Keras Code Examples playlist!

Content Links:

Chapters
0:00 Welcome to the Keras Code Examples!
0:45 Vision Transformer Explained
2:47 TensorFlow Add-Ons
3:29 Hyperparameters
7:04 Data Augmentations
8:30 Patch Construction
11:52 Patch Embeddings
14:01 ViT Classifier
16:30 Compile and Run
19:02 Analysis of Final Performance

Connor Shorten

Рекомендации по теме

Комментарии

Amazing, few people can even do this explanation line by line, great contribution democratizing AI knowledge!

artukikemty

-1 inside reshaping is a handy neat trick. Let's say you want to flatten a tensor of shapes (batch_size, 512, 16). You can easily do that by doing something like tf.reshape(your_tensor, (batch_size, -1)). You don't need to explicitly specify the flattened dimensions.

sayakpaul

It's so easy to implement ViT. Before I was afraid of using those big models because I thought it would be hard to implement, but keras and pytorch do have multiheadattention as a built-in function!

sz

When you specify `from_logits=True` softmax is first applied and then cross-entropy is taken.

sayakpaul

Hi, thanks for the video :)

In 10:35, I guess the -1 comes from the number of patches. Like if we say the batch_size=2 the output dimension of the tf.reshape function will be 2x144x108, since there are 144 patches inside the 72x72 image (patch_size=6). Also in the plotting loop, we are looping through the second dimension which has 144 element.

sinancalsr

Since TF 2.0 you can the regular plus (+) operator instead of the Add layer.

CristianGarcia

Cool job... For the "from_logits=True" part it expects only the logits (without the softmax activation) the SparseCategoricalCorssEntropy will apply softmax for you with that option...
Just be careful as, if people set from_logits to True and still apply the Softmax at the end of their network, it will apply the loss function(with the softmax) on an already probability distribution

DiogoSanti

Thank you very much for these amazing videos. Your contribution is key to the applications of these methods.

NehadHirmiz

hello thanks, but i want to ask a question,
in the input section(Extra learnable [class] embedding)
What is the zero (0) index used for and what information does it contain?

abdurrahmansefer

can we implement this ViT on our own dataset

yaswanth

Hello! Please, can you do a video on how to use Swin Transformer using an autoencoder architecture? Thank you in advance. I have a difficulty when I want to restore the patch into an image (for the decoder part)

annicetrazafindratovolahy

I second your thoughts on complementary priors. In fact, BotNets, IMO, are a step in that direction. DeIT as well.

sayakpaul

can you share the link to this notebook

sakibulislam

Can anybody explain this paragraph to me:

Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.

pakistanproud

Please i have my custom dataset with 3 folders than 3 classes how can i use the ViT please to do classification

khalladisofiane

maybe it's a silly question but does vit work on gray scale pic??

Bomerang

Your explanation is amazing, thank you very much, but I want to ask a question, what is the projection dimension and why it is 64 however the patches 144 per image and the index will be from 0 to 143?? thank you very much again for your attention?

وذكرفإنالذكرىتنفعالمؤمنين-قز

Hi,
Thank you for the explanation.
I have a question regarding the variable `position_dim`, how it was chosed? If i change the patch size, do I need to change that too?

sendjasniabderrezzaq

guys how to modify the code so i can use dataset from kaggle?

jason-ybqk

might be a stupid question, but how to visualize the attention? i honestly confused on extracting the attention

billiartag

Vision Transformer - Keras Code Examples!!

Vision Transformer - Keras Code Examples!!

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

The Vision Transformer Model (ViT)

Image Classification using Vision Transformer (ViT) in TensorFlow

Vision Transformers (ViT) Explained + Fine-tuning in Python

Building a Vision Transformers (VIT) with Tensorflow 2 from Scratch - Human Emotions Detection

Vision Transformer (ViT) Implementation In TensorFlow

Vision Transformers explained

Image Classification Using Vision Transformer | ViTs

Vision Transformer for Image Classification Using transfer learning

ResNet50 ViT - Vision Transformer with ResNet50 Implementation in TensorFlow

MobileViT Implementation in TensorFlow | Mobile Vision Transformers

Vision Transformers explained with code Pytorch

Finetuning Vision Transformers (VIT) with Huggingface Transformers and Tensorflow 2

Hybrid Vision Transformer Model (ViT)

Train and Deploy Vision Transformers for ANYTHING using Hugging Pics 🤗🖼

Vision Transformer Attention

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

Vision transformers: query and key images

Vision Transformer in PyTorch

Unlock Transformers using TensorFlow Keras | Merve Noyan #DevFest 2021

PyTorch or Tensorflow? Which Should YOU Learn!

Transformer neural network implemented with Keras in Python for English to Spanish translation

Neural Machine Translation With a Transformer and Keras