OpenAI CLIP: Connecting Text and Images

Показать описание

CLIP is a model that connects Text and Images. It has been pre-trained using 400 million (image, text) pairs for task of predicting which caption goes with which image. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
It has been tested on 30+ CV tasks like OCR, action recognition in videos, geo-localization, etc. zero-shot CLIP is often equivalent to fully supervised baseline. E.g., 0-shot CLIP is equivalent to ResNet-50 with 1.28M train set on ImageNet. Eight models show smooth accuracy improvements with scale.

In this video, I will briefly provide an overview of CLIP, its pretraining data, its pretraining architecture. We will also talk about its zero-shot performance, robustness to distribution shifts, and comparison to human performance.

Here is the agenda:

00:00:00 What is OpenAI CLIP?
00:02:09 What is contrastive pretraining? And why?
00:05:20 What dataset was used for contrastive pretraining?
00:06:30 What is the architecture of CLIP models?
00:08:38 How is CLIP used for zero-shot classification?
00:12:02 How does 0-shot CLIP perform compared to equivalent supervised classifier?
00:17:36 How do CLIP representations perform compared to other ImageNet trained representations?
00:19:46 CLIP’s robustness to Natural Distribution Shifts
00:21:23 Comparison to Human Performance
00:23:58 Bias
00:27:38 Image classification examples.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. "Learning transferable visual models from natural language supervision." In International Conference on Machine Learning, pp. 8748-8763. PMLR, 2021.

Рекомендации по теме

Комментарии

Explained it very well Sir. Thank you so much and keep posting content like this.

sibik

1. Linear probing means taking the features from the model and just putting a dense output layer on top of it. In this case, taking the visual encoder of CLIP and putting an output softmax layer on top. We freeze encoder layers ..that is why probing.
2. Also, CLIP is not truly a multimodal model. It is a great zero-shot image classifier which makes use of natural language description of class labels.

dlByManish

Great video, Manish sir, could we get these slides for future reference?

adityay

OpenAI CLIP: Connecting Text and Images

OpenAI CLIP: ConnectingText and Images (Paper Explained)

OpenAI CLIP: Connecting Text and Images

CLIP: Connecting Text and Images

CLIP: Connecting text and images

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

OpenAI CLIP Explained | Multi-modal ML

CLIP: Connecting Text and Images

OpenAI's CLIP Explained and Implementation | Contrastive Learning | Self-Supervised Learning

Fast intro to multi-modal ML with OpenAI's CLIP

OpenAI's CLIP for Zero Shot Image Classification

OpenAI CLIP model explained

OpenAI CLIP | Machine Learning Coding Series

Searching Across Images and Text: Intro to OpenAI’s CLIP

How to Implement CLIP AI: A Step-by-Step Tutorial for Beginners

Introducing CLIP OpenAI's AI Model Connecting Images and Text

Mastering OpenAI CLIP Model & Practical Usage Tutorial

Ariel Ekgren: CLIP: Connecting text and images

CLIP: OpenAI's amazing new zero-shot image classifier

CLIP: Connecting Text and Images (Swedish NLP Webinars)

Image Search in Python with OpenAI CLIP

Computer vision levels up with OpenAI’s CLIP

Various CLIP Creative Models Exploration (1/3) [OpenAI CLIP]

Experimental Films + Machine Learning Week 7 Part 1 (Aphantasia with OpenAI CLIP)

Fast Zero Shot Object Detection with OpenAI CLIP