Open-Source Vision AI - SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

Показать описание

Phi3 Vision, LLaMA 3 Vision, and GPT4o Vision are all put to the test!

Join My Newsletter for Regular AI Updates 👇🏼

Need AI Consulting? 📈

My Links 🔗

Media/Sponsorship Inquiries ✅

Disclosures:
I'm an investor in LMStudio

Рекомендации по теме

Комментарии

For future test :
1 - Ask unrelated question of a image - [Image of a car] Tell me whats wrong about my bicycle
2 - gradually zoom out of a big chunk of text in a image to see how many word the model can read
3 - A Dense detection task : Describe each element of the object in a json format with a predefine structure
4 - If possible multiple frame from a video to see a glimpse of action understanding

citaman

On the question of the size of the Photos app, GPT noted that 133 GB is larger than the max size of your phone’s storage and thus indicates that it’s possibly using cloud storage and isn’t the actual amount used by Photos on your phone. That was a really perceptive answer, so bonus points to GPT for that 😊 and perhaps that discrepancy is why the other AI seemed to be ignoring the Photos app.

philipashane

For future vision tests consider things like:
1) Finding objects - Where is waldo in this picture?
2) Counting Objects - How many bicycles are there in this picture?
3) Identifying Abnormal Objects - How many eggs in this box are broken?
4) Identifying Partially Obscured Objects - Imagine a hand holding cards - What cards are in this poker hand?
5) Identify Misplaced Objects - Which of these dishes is upside down?

murraymacdonald

For the captcha gpt4o is clearly the winner. It understands what you mean given the context and doesn’t just repeat all the letters it sees in the image.

bertobertoberto

8:40 up for interpretation.
"Photos" isnt really a standalone "app" per say, and its not the app itself that is taking up the space, it's the individual jpeg photos, which would take up the same amount of space even if you somehow didnt have the "Photos" app installed anymore.
If a person asked ME that same question, i'd also answer Whatsapp. Since that's something you can tangibly uninstall.
If they asked "what is taking up most space?" the correct answer is "Your photos". But if the question is "what APP is taking up most space", its Whatsapp.

Baleur

I think if you just consider the output quality GPT-4o is the best.
But if you also take the speed, that fact that phi3-vision is local and open-source into account phi3-vision is the most impressive one.

fabiankliebhan

Pro-tip: Try uploading a photograph you've taken or a work of art into GPT-4o and ask it to behave like an art critic (works great vanilla, but even better with custom instructions).

GPT-4o's ability to dissect the minutia of photography is absolutely wild... even to the point of giving suggestions for improving.

I wonder how long it is until photographers realize what kind of a tool they have available here. I just get a kick out of posting photographs and art and asking for critiques and ratings. It's so, so good.

CuratedCountenance

This is really awesome quality content mate, really love your work 😊🚀👍

iamachs

Nice video these kind of videos I immediately add to my playlist forever

yotubecreators

your channel is great, actually useful information, cheers

superjaykramer

I love your work. I never miss an episode. I love how you test the LLMs.

truepilgrimm

The Llama V was probably finetuned for providing verbose descriptions of images. There are other finetuned models that focus on ORC or image labelling

MrKrzysiek

Photos and Apps seem to be distinguished in the storage section of the IPhone. So if you question the largest apps the LLM ignores the photos...

donmiguel

great channel. love your work. FYI - to read a QR code, the QR code must have white space around the outside. This allows the model to pick up on the 3 big position marker squares.

topmandan

Awesome video! I was wondering how Phi-3-Vision fares compared to other vision-capable LLMs. I watched your video while I was working on my own Phi-3-Vision tests using Web UI screenshots (my hope is that it could be used for automated Web UI testing). However, Phi-3 turned out to be horrible at Web UI testing (you can see the video from my tests in my YouTube channel, if you are interested). It's nice to see that it fares much better with normal photos! Thanks for making this video - it saved me some time on testing it myself :)

tomtom_videos

I have tried dropping a simple software architecture diagram on them and asking them to extract the entities, hierarchy and connections into a json file, which usually works quite well.

MarkDurbin

I have to wonder in the case of the AI messing up the photo app taking the most space. If it recognises your photos are separate from the actual app

tepafray

GPT4o doing the analyzing on the cav prompt wasn't to call up python to look at the image but actually using Python to generate a csv output of the image since you asked for it to make the image data into csv.

AlexLuthore

LM Studio is infamously bad for vision.

In order to get it to work you have to follow the following rules:

1. Start a new chat for each photo question.
2. Reboot LM Studio for every photo question.

It’s tedious, but it can start hallucinating after the initial question.

EvanLefavor

Good stuff. Thanks. Results are very prompt dependent

ronbridegroom

Open-Source Vision AI - SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

Open-Source Vision AI - SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

PANEL: Open Source Projects Enabling the Future of Computer Vision AI

Meta Gave Vision to AI 👀 (And They Open Sourced the Data!)

Elon Musk fires employees in twitter meeting DUB

100% Open-Source AI Glasses Only $349 (with OpenAI & Perplexity)

PyTorch in 100 Seconds

Andrew Ng's Secret to Mastering Machine Learning - Part 1 #shorts

How much does an AI ENGINEER make?

Bullied Boy Levels Up 3000 Times Using Hunting System And Becomes A God of School in Real Life 2 1-3

Best 12 AI Tools in 2023

I tried to make a Valorant AI using computer vision

NEW Open Source AI Video generator is SHOCKING

Andrew Ng: Is Open Source AI a DANGER?

Deep Learning for Computer Vision with Python and TensorFlow – Complete Course

Exploring the Infinite Prospects of Computer Vision AI with Matt Zeiler

Ashneer views on Ai & jobs (shocking😱)

AI Learns to Walk (deep reinforcement learning)

Prophet Muscle Films Elon Musk Protected by his Humanoid Robot Bodyguard Eyes

C4W2L11 State of Computer Vision

🌟Why Elon Musk Named His AI 'Grok'! The Mystery Unveiled🤯#elonmusk #tesla #jordanpeterson...

Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #...

This Airless Basketball is 3D Printed!

OYR Tech Vision - Part 2 - Jim Keller on AI, RISC-V, Open Source, Innovation

11 years later ❤️ @shrads