Testing Microsoft's New VLM - Phi-3 Vision

Показать описание

In this video I go through the new Phi-3 Vision model and put it through it's paces to see what it can and can't do.

🕵️ Interested in building LLM Agents? Fill out the form below

👨‍💻Github:

⏱️Time Stamps:
00:00 Intro
00:40 Phi-3 Blog
01:49 Phi-3 Model Card
02:54 Phi-3 Paper
05:24 Code Time
05:44 Phi-3 Vision Demo
12:35 Phi-3 Demo on 4-bit

Рекомендации по теме

Комментарии

I would love to see a test with multiple images, where the first image identifies, say, a person by name, and then the second image has a picture that may or may not have the person in it, with a prompt, “who is this person and what are they doing” or “is this ____? what are they doing?”

Would be interesting for homebrew robotics explorations

JonathanYankovich

@Sam, it'd be great if you could do a video on LLM costs... How much are you paying monthly in OpenAI, Gemini and others' usage with the work and experimentations you do? And what are best practices to control costs? Do you set limits? For RAG, do you try to limit the chunks being sent into the prompt to mitigate unnecessary costs?
Or do you prefer to run open source models like Llama3? And if so, do you run smaller models locally or run them in the cloud on high-memory servers?
Keep up the good work!
Cheers,
- a happy subscriber

RobvanHaaren

9:15 it actually got the sunglasses the first time itself. If you read it again you'll see it. You missed it :)

MukulTripathi

8885 is coming from the address on the top of the receipt.

liuyxpp

I’m interested in using this to generate test data for UI applications. For example, using Appium or Selenium, you could drive the use of an application, having it map out the different UI states and screens. Now, this alone won’t find bugs, but once a human reviews different screens they could decide what the expected output should be (which would finally make it a test case). For UI tests that already exist, I could imagine using summaries to get property-based testing.

mshonle

Read the barcode on the receipt ? Doubt if it was trained for that but would not be surprised.

Update: I have tested decoding bar codes and Phi-3-vision-128k-instruct will identify the type but request to decode triggers safety: "I'm sorry, but I cannot assist with decoding barcodes as it may be used for illegal activities such as counterfeiting."

ChuckSwiger

Amazing as always. What are some better models (both open or closed source)? Thanks Sam

xuantungnguyen

Do not enable hf transfer . If you can not wait, do not use it . Or get full fiber optic connection and hope ms algoritm
Will remember you aint on cable . I believe its called reno . Ms still seem to have issue scaling up speed of bandwith

zcdnmsl

@sam, I checked why it's not in ollama. Can't be converted to GGUF yet. Some tickets in ollama and llamacpp projects.

jimigoodmojo

I've been testing some agriculture stuff, maybe fine-tuning this model with roboflow datasets and see. 🤔

amx

Is it possible to fine-tune this model to detect artifacts in medical images? I mean Screenshot of greyscale images? Or is there any open source model with those kind of capabilities?

satheeshchan

Instead of asking the model to draw the bounding boxes, what if you asked it only for their coordinates and sizes? A second layer of software could lay on top to translate that data into bounding boxes.

SpaceEngines

@Sam what are your thoughts on using a model like this to supervise fine-tune them to analyze skin for deficiencies like wrinkle, acne, pimples, blackheads, etc. ? should this do well? or is there a better model for that?

solidkundi

500B seems to be vastly under the scaling laws optimum...

JanBadertscher

Phi-3 Vision is interesting, other ones not very much

MudroZvon

It's so weird! Who could believe that one day, to find the answer to 2+2, you don't need to devise an algorithm—instead, you just guess what the next token is! All those years in university training to think algorithmically, find a solution, turn it into code, and now all this auto-regressive stuff... This is just sampling a token from a token space or language model, but...

Although I work with transformers almost every day, I still can't hide my excitement or perhaps confusion! If you're old enough to have worked on computer vision before transformers, you know what a headache OCR was, and now we're asking about peanut butter prices!!! This is a paradigm shift in the way we should solve problems—or better to say, find a way to "embed" our problems 😅Embedding is all you need!

unclecode

It got 8885 from the top of your receipt lol.

daryladhityahenry

You got the model wrong. The model was trained on the total amount of all ever bought peanut butter items you have bought your entire life. 😂😂😂 the disadvantage, if you use windows. 🤣🤣 Just kidding.

MeinDeutschkurs

Testing Microsoft's New VLM - Phi-3 Vision

Testing Microsoft's New VLM - Phi-3 Vision

New Microsoft Vision Model has AMAZING TRICKS!!!

Florence 2 - The Best Small VLM Out There?

Florence-2: Fine-tune Microsoft’s Multimodal Model

Install and Configure KMS Server in Windows Server 2022!

Masterclass on AI by Microsoft

How to Download, Install and Benchmark your PC with PerformanceTest (for Windows)

Should You Use Open Source Large Language Models?

LLM Explained | What is LLM

[CVPR 2021 VQA2VLN Tutorial] Robustness, Efficiency and Extensions for VLP

Reality of 1 Crore Salary at IIM MBA/IIT🤯 (In hand salary vs CTC) Explained by Ex- BCG, ISB MBA

PhD Thesis in 1 Day (300$): Open-Source AI

NEW Phi-3 mini 3.8B LLM for Your PHONE: 1st TEST

Autonomous AI Agents: 14% MAX Performance

Microsoft's new 'AI Agent Foundation Model' SHOCKS the Entire Industry! | 'Agent...

Video Mosaic With VLC

Microsoft OTTER - a STUNNING multimodal language model

4 Steps to Pass Microsoft Azure AI900 exam. #ai #ai900 #artificialintelligence #azure #microsoft

Top 20 Network Commands must know everyone || Basic network troubleshooting commands in Hindi

SRM (Site Recovery Manager) 8.3 Install and Configure in vSphere 7.0 - VirtualG.uk

Microsoft Hyperlapse quick test 16x playback

How to Use DALL.E 3 - Top Tips for Best Results

GraphRAG Advanced: Avoid Overspending with These Tips

LiDAR Annotator Demo