OpenAI CLIP Explained | Multi-modal ML

Показать описание

OpenAI's CLIP explained simply and intuitively with visuals and code. Language models (LMs) can not rely on language alone. That is the idea behind the "Experience Grounds Language" paper, that proposes a framework to measure LMs' current and future progress. A key idea is that, beyond a certain threshold LMs need other forms of data, such as visual input.

The next step beyond well-known language models; BERT, GPT-3, and T5 is "World Scope 3". In World Scope 3, we move from large text-only datasets to large multi-modal datasets. That is, datasets containing information from multiple forms of media, like *both* images and text.

The world, both digital and real, is multi-modal. We perceive the world as an orchestra of language, imagery, video, smell, touch, and more. This chaotic ensemble produces an inner state, our "model" of the outside world.

AI must move in the same direction. Even specialist models that focus on language or vision must, at some point, have input from the other modalities. How can a model fully understand the concept of the word "person" without *seeing* a person?

OpenAI's Contrastive Learning In Pretraining (CLIP) is a world scope three model. It can comprehend concepts in both text and image and even connect concepts between the two modalities. In this video we will learn about multi-modality, how CLIP works, and how to use CLIP for different use cases like encoding, classification, and object detection.

🌲 Pinecone article:

🤖 70% Discount on the NLP With Transformers in Python course:

🎉 Subscribe for Article and Video Updates!

👾 Discord:

Рекомендации по теме

Комментарии

Thank you so much for this great walkthrough! Looking forward to more

konichiwatanabi

Thanks for reporting, explaining and lastly opening up recent ML!

I found clip to be very interesting since I always frowned at the lost potential of two different embeddings being arbitrary and methodically separate. This is huge!

ricardojung

This was really excellent - some of the pieces are starting to make sense

mszak

Great video! Looking forward to your next video diving more into using CLIP for zero-shot classification!

DallanQuass

This is amazing James. Thanks for the detailed explanation. I am excited for the future CLIP videos 🙂.

ismailashraq

Thank you for the good explanation, if we have 2 different embeddings like texts and 3D images, we can use CLIP to predict images?

AdeleHaghighatHoseiniA

Great video. I think you may be plotting the same graph twice though (cos sim). In practice it is almost the same though it would seem.

justinmiller

Really liked the content...thanks for sharing

debashisghosh

Nice video and explanation! I think on min 28:45 you plotted cos_sim instead of dot_sim!

adrianarroyo

is there a hosted API for clip where you can provide your image data and get the vectors instead of having to host it yourself, kinda like how you give an input to `ada-002`?

abdirahmann

Thanks. It is very informative. Can you pls explain and teach us how to do fine tunning on the custome dataset. Pls

txblsfq

Thanks James, very good video about CLIP. Funny thing is that you display twice the cos_sim, so the second time it is not the dot_sim which is displayed. And you fighted to find any difference between the two similarity matrices. LOL 🤣

Great video really ! I have just one thing to say, you should let the images longer in the screen I had to pause the video multiple times to be able to understand them

Gabriel-eyky

Excellent explanation! We can build a YouTube video search engine powered by clip, perhaps you can iterate on the Nlp YouTube search video you did?

mvrdara

Excellent content! As a suggestion, can you please keep the images/diagrams a bit longer? They move pretty fast in the video, which means I'll have to rewind the video every now and then.

behnamplays

10:23 I believe CLIP is an abbreviation of Contrastive Language Image Pretraining

anantzen

Plz post on Deep Reinforcement Learning tutorials & projects with python !

pyalgoGPT

Transitions are too flashy and triggering to my eyes. Good explainer however.

mackenzieclarkson

OpenAI CLIP Explained | Multi-modal ML

OpenAI CLIP Explained | Multi-modal ML

OpenAI CLIP: ConnectingText and Images (Paper Explained)

Fast intro to multi-modal ML with OpenAI's CLIP

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

CLIP: Connecting Text and Images

Supercharge eCommerce Search: OpenAI's CLIP, BM25, and Python

How do Multimodal AI models work? Simple explanation

OpenAI's CLIP Explained and Implementation | Contrastive Learning | Self-Supervised Learning

Domain-Specific Multi-Modal Machine Learning with CLIP

Fast Zero Shot Object Detection with OpenAI CLIP

OpenAI's CLIP for Zero Shot Image Classification

OpenAI CLIP: Connecting Text and Images

Various CLIP Creative Models Exploration (1/3) [OpenAI CLIP]

OpenAI CLIP SImilar Image Search

Multimodal Neurons in Artificial Neural Networks (w/ OpenAI Microscope, Research Paper Explained)

OpenAI CLIP | Machine Learning Coding Series

Multi-modal RAG With LANGCHAIN 🦜🔗 & GPT-4V

Say hello to GPT-4o

Image Search in Python with OpenAI CLIP

Computer vision levels up with OpenAI’s CLIP

Searching Across Images and Text: Intro to OpenAI’s CLIP

CLIP, DALL E, Multimodal Neurons

OpenAI CLIP Guided Diffusion - Make images from text!

OpenAI Releases CLIP | A New AI Model That Can Identify Objects and Scenes in Images!