VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)

Показать описание

Pre-training a CNN backbone for visual transfer learning has recently seen a big push into the direction of incorporating more data, at the cost of less supervision. This paper investigates the opposite: Visual transfer learning by pre-training from very few, but very high-quality samples on an image captioning task.

OUTLINE:
0:00 - Intro & Overview
1:00 - Pre-Training for Visual Tasks
3:40 - Quality-Quantity Tradeoff
5:50 - Image Captioning
8:35 - VirTex Method
14:30 - Linear Classification
20:30 - Ablations
22:05 - Fine-Tuning
25:45 - Attention Visualization
27:30 - Conclusion & Remarks

Abstract:
The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.

Authors: Karan Desai, Justin Johnson

Links:

Рекомендации по теме

Комментарии

I really like the general direction in which the world is moving, towards a general AI model modeling both image and text with a similar model. Its amazing

pulkitgera

Wonderful explanation. As a starting point while reading the paper very helpful to get an overall idea. Thanks for all your help towards the community.

sankarshanmridha

Do you actually not sleep? Thanks though! Your explainations are quite helpful!
But still!

arkamitra

I think the authors of the paper wanted to make a state of the art text annotation model. When they failed, they pivoted and improvised :D
.
Regardless, pretty cool paper.
.
28:50 Yes. One more thing we should keep in mind is that in these images there are multiple objects per image. Whereas in imagenet dataset there is only one object per image. So, calling it a smaller dataset is technically correct, but its not the complete story.

herp_derpingson

Thanks for bringing this fresh new content every day!

JinayShah

Working on its implementation in Keras. Its so
Thank you

parmarsuraj

I have a doubt. In this video at around 6:30 min, you said that datasets with captions are more expensive to get than classification dataset like ImageNet. Image captions dataset are more expensive to create because the annotations are descriptive. But in the paper, they say the reverse: Page 1 last paragraph beginning with:”Another benefit of textual annotations is simplified data collection……” . Also in section 4.1 under the subheading, “ Annotation cost efficiency”, the authors have stated and proved “We believe that using captions is appealing due to a simple and cost-efficient collection pipeline.” Please clear my doubt and correct me if my understanding is wrong. Thank you.

debolenabasak

It's all about multi-modality and recognising that knowledge can be represented differently through different features: text features can be simply easier to learn from, while also being grounded in visual features. These are all complementary to each other and our models need to use and navigate between all these representations to actually learn good knowledge.

nikolaiilinykh

*text has entered the computer-vision-pretraining chat

DistortedV

The attention areas over images make me think if you can use an image segmentation dataset to teach the attention weights where to focus

andres_pq

Another great one! I would have loved if the authors also included a comparison with SimCLR.

sayakpaul

One could filter the image captions from the internet based on similarity with a caption generated by a sota image captioning model as to just retain the higher quality ones.

INLF

Hi Yannic, thank you for such a wonderful commentary on such an insightful paper. I have a query. Since they are comparing with ImageNet, which is a high-quality dataset, is it that when the quality-quantity trade-off is talked about, the quality refers to that of the pre-training task, and not that of the annotation quality of the dataset necessarily? As image captioning is a higher quality task than classification.

abhilashnandy

This is the direction to go for general AI. It brings together text and images/videos. Next step: train a network to produce an image or a video from text, then train the network to do your general bidding.

victorrielly

How do you record your videos? What tools?

hafezfarazi

Do you think even if the accuracy doesn't hold up for lower quality captions that the accuracy would still be higher than that of regular supervised learning? So maybe a tradeoff between quality and quantity for image captioning

siyn

VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)

VirTex: Learning Visual Representations from Textual Annotations

VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)

[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Virtex Introduction

WeCNLP 2021: LXMERT Model Compression for VQA

ALIGN: Scaling Up Visual and Vision-Language Representation LearningWith Noisy Text Supervision

Justin Johnson - Invited Talk at the VQA Workshop 2021

That's Why Mohit Sir Called 'God Of Mathematics'| Puzzle Brain teaser | #competishun ...

Structured Representations of the Visual World, Chen Sun

ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Social visual representations in humans and machines

Plan2Scene - CVPR 2021

CLIP: Connecting Text and Images

Understanding Multimodal Representation of Image-Text Data

CVPR 2022 CaDeX Method part1

An Empirical Study of Training Self-Supervised Vision Transformers

CVPR 2021 Quasi-Dense Similarity Learning for Multiple Object Tracking

Contrastive Learning for Unpaired Image-to-Image Translation (ECCV 2020 Teaser)

Amazing Water & Sound Experiment #2

OpenAI CLIP - Connecting Text and Images | Paper Explained

CVPR 2021: Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

ActivityNet Event Dense-Captioning

[CVPR 2021] Self-supervised Augmentation Consistency for Adapting Semantic Segmentation