Pixtral is REALLY Good - Open-Source Vision Model

Показать описание

Let's test Pixtral, the newest vision model from MistralAI.

Join My Newsletter for Regular AI Updates 👇🏼

My Links 🔗

Media/Sponsorship Inquiries ✅

Рекомендации по теме

Комментарии

We need AI doctors for everyone on earth

LoveLifePD

Show it a Tech Sheet on a simple divise, like a Dryer, ask it what it is. ask it to outline the circuit for the heater. Give it a symptom, like, the dryer will not start. then ask it to reason out the step by step trouble shooting procedure using the wiring diagram and a multimeter with live voltage.

YOGiiZA

I'd be reallyyy interested to see more tests on how well it handles positionality, since vision models have tended to struggle with that. As I understand it, that's one of the biggest barriers to having models operate UIs for us

justtiredthings

"Mistral" is (English/American-ised) pronounced with an "el" sound. Pixtral would be similar. So "Pic-strel" would be appropriate. However the French pronunciation is with an "all" sound. Since mistral is a French word for a cold wind that blows across France, I would go with that for correctness. It's actually more like "me-strall", so in this case "pic-strall" should be correct.

At any rate, I look forward to a mixture of agents/experts scenario where pixtral gets mixed in with other low/mid weight models for fast responses.

Feynt

Matt, you made a point regarding decent smaller models used for specialized tasks. That comment reminds me of Agents obviously, each seemingly with their own specialized model for tasks and a facilitator to delegate to agents. I think most want to see smaller and smaller open source models getting better and better on benchmarks.

Transforming-AI

I’d love to see some examples of problems where two or more models are used together. Maybe Pixtral describes a chess board, then an economical model like Llama translates that into standard chess notation, and then 01 does the deep thinking to come up with the next move. (I know 01 probably doesn’t need the help from Llama in this scenario, but maybe doing it this was would be less expensive than having 01 do all the work).

ChrisAdaline

I think you should try giving it a photo of the word "Strawberry" and then ask it to tell you how many letter r's are in the word.
Maybe vision is all we needed to solve the disconnect from tokenization?

sleepingbag

Where is GPT-4o live screenshare option?

DK.CodeVenom

Great video, Matthew! Just a suggestion for testing vision models based on what we do internally. We feed images from the James Webb telescope into the model and ask it to identify what we can already see. One thing to keep in mind is that if something's tough for you to spot, the AI will likely struggle too. Vision models are great at 'seeing outside the box, ' but sometimes miss what's right in front of them. Hope that makes sense!

idontexist-satoshi

When you next test vision models you should try giving it architectural floor plans to describe, and also correlate various drawings like a perspective rendering or photo vs a floor plan (of the same building), which requires a lot of visual understanding. I did that with Claude 3.5 and it was extremely impressive.

hypertectonics

I foresee a time period where the AI makes captchas for humans to keep us from meddling in important things

"Oh, you want to look at the code used to calculate your state governance protocols? Sure, solve this quantum equation in under 3 seconds!"

jeffsmith

Nonchalantly says Captcha is done. That was good.

opita

For the bill gates one, you put in an image with "bill gates" in the filename! Doesn't that give the model a huge hint as to the content of the photo?

timtim

A picture of a spreadsheet with questions about it would be gold for real use cases.

josephflowers

You should add an OCR test for handwritten text to the image models.

nginnnginn

The big question for me, is when will Pixtral be available on Ollama, which is my interface of choice... If it will work on Ollama, it opens up a world of possibilities.

whitneydesignlabs

Do you think an AGI would be basically these specialised use-case LLMs working as agents for a master LLM?

JustaSprigofMint

I just signed up with Vutr and was wondering if you were going to do any videos on this? Does anyone know of training for this? I want to run my Lama on it.

lordjamescbeeson

I am trying LM studio but the model that is available is text-only, is there a way to get the vision model loaded into LM studio?

Pauluz_The_Web_Gnome

7:50 my iPhone could not read that this is QR code.

JoelSapp

Pixtral is REALLY Good - Open-Source Vision Model

Pixtral is REALLY Good - Open-Source Vision Model

Pixtral (Fully Tested): Mistral's NEW VISION LLM is Finally Here & Beats Qwen-2 VL?

Is PIXTRAL the NEW King of AI?

Mistral Pixtral Large Released! Did it pass the test ?

Pixtral AI is able to recognize people in images!

Pixtral Large 124B | The FUTURE of Multimodal AI | Open Weights & API Access

Insane New AI Model - PIXTRAL Large - That Finally Beats OpenAI and Google

Mistral Unveils Free Pixtral Model & AI Coding Tool Enhancements

Fine tuning Pixtral - Multi-modal Vision and Text Model

Pixtral 12B Model Review: Great for Images, Not So Much for Multilingual

Pixtral-12B 👀: Mistral AI's First Multi-Modal VLLM is HERE!

Pixtral by Mistral - Vision Model with 12B Parameters

Pixtral 12b just broke the ankles of other multimodal models - Paper Review

Install Pixtral 12B Locally - Mistral's First Multi-modal Model

Llama 3.2 - 90B (Fully Tested): Is it actually good and beat Pixtral 12B ?

🔴 Did Pixtral Just Create a Monster? Uncover the Shocking Benchmark Results!

NEW Llama 3.2 11B vs 90B VISION (Pixtral 12B, GPT4o)

Pixtral 12B - the first-ever multimodal Mistral model FINALLY | Wow Mistral AI

Pixtral AI is able to recognize people in images!

Pixtral 12B (Full Test): Can It Crush Qwen-2 VL as a Vision LLM?

Pixtral Large: The New Titan in AI

Can #Mistral's Pixtral Take on #ChatGPT and #Claude?

Insane New Al Model - PIXTRAL Large - That Finally Beats OpenAl and Google

Multimodal RAG with Pixtral and Milvus