Extract Text from Image with Tesseract OCR

preview_player
Показать описание
Optical character recognition (OCR) is the electronic conversion of images of typed, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, or subtitle text superimposed on an image (e.g., from a television broadcast). OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes.

Tesseract is an OCR engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006. In this tutorial, we will use Tesseract OCR to convert printed text in a JPG image of an old news paper article to computer-encoded text.

Code used in this video can be downloaded from GitHub:

Hashtags: #tesseract #ocr #opticalcharacterrecognition #tutorial #tutorials #artificialintelligence #machinelearning #deeplearning #python #pythonprogramming #pythontutorial #aitutorial #coding
Рекомендации по теме
Комментарии
Автор

Hi. I am trying to use tesseract to extract some numbers from images but I could not succeed. Is there a way to train or improve recognition? I need to recognise only numbers.
Thanks

skoomaaddict
visit shbcf.ru