Open-Source Vision AI - SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

preview_player
Показать описание
Phi3 Vision, LLaMA 3 Vision, and GPT4o Vision are all put to the test!

Join My Newsletter for Regular AI Updates 👇🏼

Need AI Consulting? 📈

My Links 🔗

Media/Sponsorship Inquiries ✅

Disclosures:
I'm an investor in LMStudio
Рекомендации по теме
Комментарии
Автор

For future test :
1 - Ask unrelated question of a image - [Image of a car] Tell me whats wrong about my bicycle
2 - gradually zoom out of a big chunk of text in a image to see how many word the model can read
3 - A Dense detection task : Describe each element of the object in a json format with a predefine structure
4 - If possible multiple frame from a video to see a glimpse of action understanding

citaman
Автор

On the question of the size of the Photos app, GPT noted that 133 GB is larger than the max size of your phone’s storage and thus indicates that it’s possibly using cloud storage and isn’t the actual amount used by Photos on your phone. That was a really perceptive answer, so bonus points to GPT for that 😊 and perhaps that discrepancy is why the other AI seemed to be ignoring the Photos app.

philipashane
Автор

For future vision tests consider things like:
1) Finding objects - Where is waldo in this picture?
2) Counting Objects - How many bicycles are there in this picture?
3) Identifying Abnormal Objects - How many eggs in this box are broken?
4) Identifying Partially Obscured Objects - Imagine a hand holding cards - What cards are in this poker hand?
5) Identify Misplaced Objects - Which of these dishes is upside down?

murraymacdonald
Автор

For the captcha gpt4o is clearly the winner. It understands what you mean given the context and doesn’t just repeat all the letters it sees in the image.

bertobertoberto
Автор

8:40 up for interpretation.
"Photos" isnt really a standalone "app" per say, and its not the app itself that is taking up the space, it's the individual jpeg photos, which would take up the same amount of space even if you somehow didnt have the "Photos" app installed anymore.
If a person asked ME that same question, i'd also answer Whatsapp. Since that's something you can tangibly uninstall.
If they asked "what is taking up most space?" the correct answer is "Your photos". But if the question is "what APP is taking up most space", its Whatsapp.

Baleur
Автор

I think if you just consider the output quality GPT-4o is the best.
But if you also take the speed, that fact that phi3-vision is local and open-source into account phi3-vision is the most impressive one.

fabiankliebhan
Автор

Pro-tip: Try uploading a photograph you've taken or a work of art into GPT-4o and ask it to behave like an art critic (works great vanilla, but even better with custom instructions).

GPT-4o's ability to dissect the minutia of photography is absolutely wild... even to the point of giving suggestions for improving.

I wonder how long it is until photographers realize what kind of a tool they have available here. I just get a kick out of posting photographs and art and asking for critiques and ratings. It's so, so good.

CuratedCountenance
Автор

This is really awesome quality content mate, really love your work 😊🚀👍

iamachs
Автор

Nice video these kind of videos I immediately add to my playlist forever

yotubecreators
Автор

your channel is great, actually useful information, cheers

superjaykramer
Автор

I love your work. I never miss an episode. I love how you test the LLMs.

truepilgrimm
Автор

The Llama V was probably finetuned for providing verbose descriptions of images. There are other finetuned models that focus on ORC or image labelling

MrKrzysiek
Автор

Photos and Apps seem to be distinguished in the storage section of the IPhone. So if you question the largest apps the LLM ignores the photos...

donmiguel
Автор

great channel. love your work. FYI - to read a QR code, the QR code must have white space around the outside. This allows the model to pick up on the 3 big position marker squares.

topmandan
Автор

Awesome video! I was wondering how Phi-3-Vision fares compared to other vision-capable LLMs. I watched your video while I was working on my own Phi-3-Vision tests using Web UI screenshots (my hope is that it could be used for automated Web UI testing). However, Phi-3 turned out to be horrible at Web UI testing (you can see the video from my tests in my YouTube channel, if you are interested). It's nice to see that it fares much better with normal photos! Thanks for making this video - it saved me some time on testing it myself :)

tomtom_videos
Автор

I have tried dropping a simple software architecture diagram on them and asking them to extract the entities, hierarchy and connections into a json file, which usually works quite well.

MarkDurbin
Автор

I have to wonder in the case of the AI messing up the photo app taking the most space. If it recognises your photos are separate from the actual app

tepafray
Автор

GPT4o doing the analyzing on the cav prompt wasn't to call up python to look at the image but actually using Python to generate a csv output of the image since you asked for it to make the image data into csv.

AlexLuthore
Автор

LM Studio is infamously bad for vision.

In order to get it to work you have to follow the following rules:

1. Start a new chat for each photo question.
2. Reboot LM Studio for every photo question.

It’s tedious, but it can start hallucinating after the initial question.

EvanLefavor
Автор

Good stuff. Thanks. Results are very prompt dependent

ronbridegroom