How AI 'Understands' Images (CLIP) - Computerphile

preview_player
Показать описание
With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to turn text strings into plausible images? Dr Mike Pound expands on his explanation of Diffusion models.

This video was filmed and edited by Sean Riley.

Рекомендации по теме
Комментарии
Автор

As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!

michaelpound
Автор

thank you for "if you want to unlock your face with a phone".. i needed that in my life

adfaklsdjf
Автор

This guy is one of the best teachers I have ever seen.

pyajudeme
Автор

The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.

orange-vlcybpd
Автор

Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers

edoardogribaldo
Автор

I'm a simple guy. I see a Mike Pound video, I click

uneasy_steps
Автор

That cat got progressively more turtle-like with each drawing.

chloupichloupa
Автор

Thanks for taking us to Pound town. Great explanation!

aprilmeowmeow
Автор

The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.

bluekeybo
Автор

These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.

skf
Автор

The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.

beardmonster
Автор

"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024

eholloway
Автор

9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?

MichalKottman
Автор

6:09
"There isn't red cats"
Mike is hilarious and a great teacher lol

rigbyb
Автор

So, if we want to break AI, we just have to pollute the internet with a couple billion pictures of red cats with the caption “blue dog”.

lucianoag
Автор

Excellent. Could listen to him all day and even understand stuff.

Shabazza
Автор

A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)!
After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt).

You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).

TheRealWarrior
Автор

I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.

wouldntyaliktono
Автор

Love seeing AI problems explained on fanfold paper. Classy!

musikdoktor
Автор

Computerphile and Dr. Pound ♥️✨ I've been learning AI myself these past few months so this is just wonderful. Thanks a ton! :)

codegallant