An image is worth 16x16 words: ViT | Vision Transformer explained

preview_player
Показать описание
Mom, it's the Transformers again! They have come to ruin my CNN building blocks! 🥺 An Image is Worth 16x16 Words: paper explained. Is this the extinction of CNNs? Long live the Transformer?

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Outline:
* 00:00 Pure Transformer for vision
* 01:17 How does it work?
* 03:58 The CNN Armageddon?

📄 Paper (not anonymous anymore): "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

-----------------------------------
🔗 Links:

#AICoffeeBreak #MsCoffeeBean #ComputerVision #ICLR2021 #MachineLearning #AI #research

Video contains emojis designed by OpenMoji – the open-source emoji and icon project. License: CC BY-SA 4.0
Рекомендации по теме
Комментарии
Автор

I recently found this channel and I've been binge-watching your videos ever since. Great Job!

yemiyesufu
Автор

Love them humors. Keep up the good work

JohnDoe-ftmq
Автор

Great video! Especially relevant for me because I was just talking with a professor about how transformers seem to dominate everything in nlp these days. And I think I have an inkling of who these anonymous authors are--looking at you TPUs 😂

dianai
Автор

The first layer of this model is still a convolution.

jonatani
Автор

Great job lady! Watching your videos while in the gym :-)

sarahjamal
Автор

The realm dominated for centuries by CNNs. Lol. :P
Nice video Letitia!
Do you make your own animations for the explanations of the algorithm?

ShubhamYadav-xrtw
Автор

nice video. However, I misunderstood something. at 3:45 when you said that "the given pattern can be a limitation" are you talking about the transformer or the CNNs?

keroldjoumessi
Автор

but once trained, can it be used as part of transfer learning?

orjihvy
Автор

What Can I say other than a simple `Thank you!'... 🙂

bartlomiejkubica
Автор

btw, first linear projection on patches of 16x16 pixels is essentially or mathematically is convolution with kernel size of 16 and stride 16. So anonymous authors are not proposing anything new :P, it essentially very similar to non local newural networks

speedmph
Автор

It tooks ViT 400m images to achieve just about what CNN does on ImageNet 1M, and with only 10-20m params, ViT took the order of magnitude more params though. Simply put, in NLP there are at most a few hundred thousands of words. Well in imaging, you can guess the wildered diversity of images, that is why CNN works.

nguyenanhnguyen
Автор

Small custom dataset for ultrasound images how can we achieve state of art performance

NasirAlipro
Автор

Thanks your explanation is amazing
But can you explain it with some details

xwcpfon
Автор

can this vision transformer be used on audio spectrogram ? and used for my specific related task ?

HaiderAli-nmoh
Автор

Nice video !! :) However the last sentence of the abstract is: "Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train" but you seem to say the opposite in the video. Did I miss something ? Thanks.

Freeak