OpenAI CLIP: Connecting Text and Images

preview_player
Показать описание
CLIP is a model that connects Text and Images. It has been pre-trained using 400 million (image, text) pairs for task of predicting which caption goes with which image. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
It has been tested on 30+ CV tasks like OCR, action recognition in videos, geo-localization, etc. zero-shot CLIP is often equivalent to fully supervised baseline. E.g., 0-shot CLIP is equivalent to ResNet-50 with 1.28M train set on ImageNet. Eight models show smooth accuracy improvements with scale.

In this video, I will briefly provide an overview of CLIP, its pretraining data, its pretraining architecture. We will also talk about its zero-shot performance, robustness to distribution shifts, and comparison to human performance.

Here is the agenda:

00:00:00 What is OpenAI CLIP?
00:02:09 What is contrastive pretraining? And why?
00:05:20 What dataset was used for contrastive pretraining?
00:06:30 What is the architecture of CLIP models?
00:08:38 How is CLIP used for zero-shot classification?
00:12:02 How does 0-shot CLIP perform compared to equivalent supervised classifier?
00:17:36 How do CLIP representations perform compared to other ImageNet trained representations?
00:19:46 CLIP’s robustness to Natural Distribution Shifts
00:21:23 Comparison to Human Performance
00:23:58 Bias
00:27:38 Image classification examples.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. "Learning transferable visual models from natural language supervision." In International Conference on Machine Learning, pp. 8748-8763. PMLR, 2021.
Рекомендации по теме
Комментарии
Автор

Explained it very well Sir. Thank you so much and keep posting content like this.

sibik
Автор

1. Linear probing means taking the features from the model and just putting a dense output layer on top of it. In this case, taking the visual encoder of CLIP and putting an output softmax layer on top. We freeze encoder layers ..that is why probing.
2. Also, CLIP is not truly a multimodal model. It is a great zero-shot image classifier which makes use of natural language description of class labels.

dlByManish
Автор

Great video, Manish sir, could we get these slides for future reference?

adityay