VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)

preview_player
Показать описание
Pre-training a CNN backbone for visual transfer learning has recently seen a big push into the direction of incorporating more data, at the cost of less supervision. This paper investigates the opposite: Visual transfer learning by pre-training from very few, but very high-quality samples on an image captioning task.

OUTLINE:
0:00 - Intro & Overview
1:00 - Pre-Training for Visual Tasks
3:40 - Quality-Quantity Tradeoff
5:50 - Image Captioning
8:35 - VirTex Method
14:30 - Linear Classification
20:30 - Ablations
22:05 - Fine-Tuning
25:45 - Attention Visualization
27:30 - Conclusion & Remarks

Abstract:
The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.

Authors: Karan Desai, Justin Johnson

Links:
Рекомендации по теме
Комментарии
Автор

I really like the general direction in which the world is moving, towards a general AI model modeling both image and text with a similar model. Its amazing

pulkitgera
Автор

Wonderful explanation. As a starting point while reading the paper very helpful to get an overall idea. Thanks for all your help towards the community.

sankarshanmridha
Автор

Do you actually not sleep? Thanks though! Your explainations are quite helpful!
But still!

arkamitra
Автор

I think the authors of the paper wanted to make a state of the art text annotation model. When they failed, they pivoted and improvised :D
.
Regardless, pretty cool paper.
.
28:50 Yes. One more thing we should keep in mind is that in these images there are multiple objects per image. Whereas in imagenet dataset there is only one object per image. So, calling it a smaller dataset is technically correct, but its not the complete story.

herp_derpingson
Автор

Thanks for bringing this fresh new content every day!

JinayShah
Автор

Working on its implementation in Keras. Its so
Thank you

parmarsuraj
Автор

I have a doubt. In this video at around 6:30 min, you said that datasets with captions are more expensive to get than classification dataset like ImageNet. Image captions dataset are more expensive to create because the annotations are descriptive. But in the paper, they say the reverse: Page 1 last paragraph beginning with:”Another benefit of textual annotations is simplified data collection……” . Also in section 4.1 under the subheading, “ Annotation cost efficiency”, the authors have stated and proved “We believe that using captions is appealing due to a simple and cost-efficient collection pipeline.” Please clear my doubt and correct me if my understanding is wrong. Thank you.

debolenabasak
Автор

It's all about multi-modality and recognising that knowledge can be represented differently through different features: text features can be simply easier to learn from, while also being grounded in visual features. These are all complementary to each other and our models need to use and navigate between all these representations to actually learn good knowledge.

nikolaiilinykh
Автор

*text has entered the computer-vision-pretraining chat

DistortedV
Автор

The attention areas over images make me think if you can use an image segmentation dataset to teach the attention weights where to focus

andres_pq
Автор

Another great one! I would have loved if the authors also included a comparison with SimCLR.

sayakpaul
Автор

One could filter the image captions from the internet based on similarity with a caption generated by a sota image captioning model as to just retain the higher quality ones.

INLF
Автор

Hi Yannic, thank you for such a wonderful commentary on such an insightful paper. I have a query. Since they are comparing with ImageNet, which is a high-quality dataset, is it that when the quality-quantity trade-off is talked about, the quality refers to that of the pre-training task, and not that of the annotation quality of the dataset necessarily? As image captioning is a higher quality task than classification.

abhilashnandy
Автор

This is the direction to go for general AI. It brings together text and images/videos. Next step: train a network to produce an image or a video from text, then train the network to do your general bidding.

victorrielly
Автор

How do you record your videos? What tools?

hafezfarazi
Автор

Do you think even if the accuracy doesn't hold up for lower quality captions that the accuracy would still be higher than that of regular supervised learning? So maybe a tradeoff between quality and quantity for image captioning

siyn