An image is worth 16x16 words: ViT | Vision Transformer explained

Показать описание

Mom, it's the Transformers again! They have come to ruin my CNN building blocks! 🥺 An Image is Worth 16x16 Words: paper explained. Is this the extinction of CNNs? Long live the Transformer?

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Outline:
* 00:00 Pure Transformer for vision
* 01:17 How does it work?
* 03:58 The CNN Armageddon?

📄 Paper (not anonymous anymore): "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

-----------------------------------
🔗 Links:

#AICoffeeBreak #MsCoffeeBean #ComputerVision #ICLR2021 #MachineLearning #AI #research

Video contains emojis designed by OpenMoji – the open-source emoji and icon project. License: CC BY-SA 4.0

Рекомендации по теме

Комментарии

I recently found this channel and I've been binge-watching your videos ever since. Great Job!

yemiyesufu

Love them humors. Keep up the good work

JohnDoe-ftmq

Great video! Especially relevant for me because I was just talking with a professor about how transformers seem to dominate everything in nlp these days. And I think I have an inkling of who these anonymous authors are--looking at you TPUs 😂

dianai

The first layer of this model is still a convolution.

jonatani

Great job lady! Watching your videos while in the gym :-)

sarahjamal

The realm dominated for centuries by CNNs. Lol. :P
Nice video Letitia!
Do you make your own animations for the explanations of the algorithm?

ShubhamYadav-xrtw

nice video. However, I misunderstood something. at 3:45 when you said that "the given pattern can be a limitation" are you talking about the transformer or the CNNs?

keroldjoumessi

but once trained, can it be used as part of transfer learning?

orjihvy

What Can I say other than a simple `Thank you!'... 🙂

bartlomiejkubica

btw, first linear projection on patches of 16x16 pixels is essentially or mathematically is convolution with kernel size of 16 and stride 16. So anonymous authors are not proposing anything new :P, it essentially very similar to non local newural networks

speedmph

It tooks ViT 400m images to achieve just about what CNN does on ImageNet 1M, and with only 10-20m params, ViT took the order of magnitude more params though. Simply put, in NLP there are at most a few hundred thousands of words. Well in imaging, you can guess the wildered diversity of images, that is why CNN works.

nguyenanhnguyen

Small custom dataset for ultrasound images how can we achieve state of art performance

NasirAlipro

Thanks your explanation is amazing
But can you explain it with some details

xwcpfon

can this vision transformer be used on audio spectrogram ? and used for my specific related task ?

HaiderAli-nmoh

Nice video !! :) However the last sentence of the abstract is: "Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train" but you seem to say the opposite in the video. Did I miss something ? Thanks.

Freeak

An image is worth 16x16 words: ViT | Vision Transformer explained

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

An image is worth 16x16 words: ViT | Vision Transformer explained

An Image Is Worth 16x16 Words - Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

An Image is Worth 16x16 Words Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) - An Image is Worth 16x16 Words: Transformers for Image Recognition

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[Paper Review] An Image is worth 16x16 words: transformers for image recognition at scale

MLT __init__ Session #7: An Image is Worth 16x16 Words

An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)

ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation

(ViT) An Image Is Worth 16x16 Words | Paper Explained

Vision Transformer for Image Classification

ViT: An Image is Worth 16x16 Words Explained

Vision Transformer Visualisation (An image is worth 16x16 words)

Image Classification Using Vision Transformer | An Image is Worth 16x16 Words

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[220728] An Image is worth 16x16 words Transformers for image recognition at scale

Vision Transformer(ViT) - Image is worth 16x16 words | Paper Explained

Paper Talks #1 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[Paper Review] ViT: An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale

#19 ViT: An Image is Worth 16x16 Words

[Vision Transformer] An Image is Worth 16 x 16 Words : Transformer for Image Recognition at Scale

MLT init Session #7: An Image is Worth 16x16 Words