Mistral OCR - Multimodal & Multilingual OCR

preview_player
Показать описание
In this video, I look at the latest release from Mistral AI, which is their Mistral OCR model. I look at how it works and how it compares to other models, as well as how you can get started using it with code.

For more tutorials on using LLMs and building agents, check out my Patreon

🕵️ Interested in building LLM Agents? Fill out the form below

👨‍💻Github:

⏱️Time Stamps:
00:00 Intro
00:17 Other models
00:35 Mistral OCR Blog
05:45 Mistral OCR Demo
13:47 Mistral OCR Batch inference
Рекомендации по теме
Комментарии
Автор

I follow you just because of your honest review without any false hypes. Love your content man!

chhabiacharya
Автор

Nice, it's something we can use at our company, probably also for some personal use cases... Thanks for covering this. Company processes a ton of legal Arabic docs and has been blocked on this issue for sometime due to quality issues.

akayx
Автор

Extra useful. I think OCR is an area that still doesn't have a clear winner, especially in more obscure languages. And I agree that it seems like a feasible strategy for smaller companies to develop nifty tools like this.

However, if I were to use it for my hypothetical company, I'd struggle with my security concerns. If I understand this correctly, you'll be sharing your vital data with one more AI company through their API. I'd probably use a local LLM for data analysis, which makes it really hard to concede to a Mistral API just for the sake of OCR ... unless that's my only option.

Dr.UldenWascht
Автор

clear, straight to the point, very cool video

brunosavoca
Автор

I have tryed to upload manual bills into mistral Le Chat and Chat GPT, difference was obvious in terms of performance. Le Chat was not able to extract complicated type of writtings. Hope this new model will go further.

Bellevillezogataga
Автор

It’s very good for well structured pdfs and images, which we have at work, for another (large) batch of more unstructured/hand written and more messy content, still better to do computer vision with google ocr (with precise bboxes).

alchemication
Автор

Any testing with handwriting? A lot of OCR use cases end up processing documents that are a mix of print and someone "annotating" with pen afterward.

Aberger
Автор

Great vid and good to see this tech improve. Now, how to get the OCR data into a multimodal vector embedding. That's the next missing piece for me. VoyageAI maybe. The base_64 could be used in a multimodal embedding, maybe?

BrandonFoltz
Автор

To try it out you can upload an image through Le Chat, without going through the API (or having to set up payments).

So I have a scan of an old Polish book in which every line is missing half of the last word because the scan was cut off. I told it to try to guess what the last word was (something which a native speaker is able to do with high accuracy). Unfortunately, it failed in most of those guess-the-word cases, but it did a pretty good job for such words that were fully visible - without any prepping up of the scanned low-contrast text on my part. It made a couple errors for the visible words, too, but not any more than I've seen other OCR packages do in the past for random prose. Trying to do it a second pass to e.g. fix the incoherent words it produced, just by proof-reading the text, did not bring any improvement.

clray
Автор

This would be good for building out a graphrag type system. Get this model to extract all the documents and then send it to another cheap model that can start doing more processing before throwing everything into a graph database. Especially with it only being a dollar per thousand pages it would be dirt cheap to have this working along side something like Gemini

pin
Автор

I wonder if mistral ocr will return coordinates like Azure document intelligence. This is important for highlighting the original texts for some applications for human review

wangbei
Автор

I just got Mistral OCR working, and... it has fallen over on my very first attempt. I used the cover of the Local Hero soundtrack CD as a simple test image (large Times New Roman on a plain background), and I got back "LOCAL HIERO" in response. Oh dear, first impressions are not good. I've been doing lots of OCR in the last fortnight and found Claude 3.5 and Claude 3.7 to be very good. Claude 3.7 in particular was 100% accurate on documents I gave it and could generate up to 12-page long Markdown documents in one go. Both Claude models have even been able to incorporate handwritten annotations on a typed document. Just mentioning if it helps others.

KohanIkin
Автор

Can you get accurate bounding boxes where each piece of text is found in the image? All the models I've tried so far struggle with that, but it is a required feature for e.g. screen recognition and agents that are suppoed to operate UIs for you.

clray
Автор

Do you think this model will do well with hand written data extraction?

okwudex
Автор

Comparison with AllenAI OlmOCR open model and Gemini 2.0 Flash would be interesting-there are competing claims and OlmOCR requires a bit more tooling locally

jonchun
Автор

The Colab link in the description is for YT - Phi-4 Multimodal notebook not Mistral OCR notebook

MichealAngeloArts
Автор

Out of 13 typed lone Thai words, it seems to miss small character differences like ร vs ว and บ vs ม. I wonder about longer text to give it more context.

pawinpawin
Автор

Does it gives x, y cordinate style details output with text like as azure ocr?

shriradhe
Автор

Arabic has been the biggest challenge for OCR software over the past years. I am not a tech savvy could you tell us how to use it.

zidane
Автор

Unless it's open source, I'm not really interested. Gemini 2.0 has never failed me for this

equious