NEW GPT-4o Vision API: Best Way to Copy Text from Image (OCR in Python)

preview_player
Показать описание
OpenAI has released a new model, GPT-4o with Vision capabilities built right into its API. It is advertised as more accurate, faster and half the cost of the vision capabilities in the previous model. In this video we put that to the test, and try out using a python script to extract text off invoices (even handwritten ones). Also, I will show some tricks to get consistent output from the API for different types of images.

GitHub Link to Starter code:
Рекомендации по теме
Комментарии
Автор

Totally agree with the Llama comment at the end: every company is going to want to build their own model (trained on basic open source libraries data and with their own data on top of it). I still struggle with understanding how that new world will look like... A bunch of "Jarvis" everywhere? Can you make a video of what you think interacting with that new internet ecosystem may look like?
Thanks!

pjgilcunha
Автор

thank you for the video- GPT4o is my default model at the moment - but I test other LMMs as well -)

micbab-vgmu
Автор

Hey, I was doing the same thing before i found your video but adding response_format was a great help. Thanks!
Now i am finding a way where i send multiple images to gpt4o and get an indicator if image is rotated(it does not work on rotated images) now when it comes to multiple images i need an identifier of them to rotate required image only, Do you have anything in mind?

rajmandaviya
Автор

Thank you so much!
I still encounter some issues like I'm uploading an Invoice and every time it gives different vendor name(upper case, lower case) and how to mention date format in JSON Schema? it always return different format. how can I prompt this?

devamsanghavi
Автор

Hi. Do you know the limit of tokens i can use? Im trying to transcribe an image with a lot of text, but it it stops in the middle. it seems the maximum of tokens i can use is around 1000.. How can i set more tokens per request?

danielalbano
Автор

Hello!

Thank you for the insightful video. I am currently working on a side project using GPT-4 to extract handwritten text from paper. However, since handwritten text varies greatly and some handwriting can be very difficult to read, there are occasional extraction errors that could affect the product's credibility.

I am considering implementing a method where (1) the confidence level of each extraction is assessed, and (2) if the confidence level falls below a certain threshold, (3) the result is marked as N/A or skipped. However, I am in first step using GPT to make product, as I am a product manager, not a developer. Do you have any advice on how to handle this issue?

Thank you once again for your helpful video.

Best regards,
From Korea

JAYJang-mezh
Автор

Hi,
Thanks for this video on using the GPT-4o Vision API. I'm using the code shown to detect text in images, and it's working very well. However, when I request the pixel coordinates for sections of the invoice (general information, product details, and payments), the accuracy is not very good.

Could you provide some advice or demonstrate how to improve the accuracy of the pixel coordinates for each section in the image? I need to locate specific areas like the invoice number, client information, tax ID (CIF or NIF), product details, and payment information such as the total amount and VAT.

Thanks in advance for any help!

Eric
Автор

I tried to use the vision functionality, but unfortunately sometimes it invents the numbers and even if I force it in the prompt it doesn't do it :(

xmagcx