Using Tesseract-OCR to extract text from images

preview_player
Показать описание
In this video we use tesseract-ocr to extract text from images in English and Korean. Optical character recognition is useful in cases of data hiding or simple embedded PDF. For OCR using tesseract, we must first convert PDF documents to high-resolution images.

Tutorial found here:

010001000100011001010011011000110110100101100101011011100110001101100101
Get more Digital Forensic Science

010100110111010101100010011100110110001101110010011010010110001001100101

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please link back to the original video. If you want to use this video for commercial purposes, please contact us first. We would love to see what you are doing.
Рекомендации по теме
Комментарии
Автор

Excellent Videos! As a second-language speaker, i appreciate your accurate spoken english a lot.Thanks!

Shaalimar
Автор

Thank you so much, it helps me dive into ocr really quickly.

fengxie
Автор

Thanks!! Hard to come across a tutorial as well explained as this one

axelmarruenda
Автор

for more than one language you could use the + sign to concatenate the 3-character ISO 639-2 language codes (see the man page)
e.g.
tesseract out.tiff -l eng+kor multi.txt

Mike.Freeman
Автор

Thank you so much! This is the simplest tutorial I could think of, that explains tesseract in depth.

jonathanvillatorocordoba
Автор

Thanks so much, it's very clear for not native English speaker too.

stefanodeboni
Автор

clear and concise! can't help but subscribe. Thanks buddy!

randomtoons
Автор

Finally a Native English speaker tutorial for this. Thank you very much.

Teck_
Автор

I liked all your videos which are very informative. you should produce more videos often. thanks

ahsan-lish
Автор

how we can apply the ImagetoString function for a live feed of cv2 (frames)??

havoclyyours
Автор

Joshua is there a way we can know if pdf contains graphical data (table, charts, graph, etc)?

hayatt
Автор

What if tesseract is unable to recognize the English font "Ford's folly italic and ladylike BB font " ? How do we embid the font into tesseract for recognising the characters in the PDF ?

zenoshirani
Автор

Not sure this was possible when this video came out, but a quick Google search just showed me that it seems to be possible to hand over several languages as parameters (using "+") at the same time.

ilianos
Автор

Did you ever find a way to combine the text from 2 languages? I have a 270 page pdf in Simplified Chinese with around 1/3rd in English....such a nightmare to translate.

UpcycleElectronics
Автор

So would I be able to recognize numbers and do math problems with them?

accentor
Автор

Need your opinion. I'm researching of how to take a jpeg photograph receipt and run a java app to get the text from the receipt. Is Tesseract would be a best solution?

lilazeonboa
Автор

thank you. Excellent video! how to install textract on windows 7 x64?

eloiulrichguebayi
Автор

Very detailed tutorial, can you show how to use PaddleOCR next time? It includes more languages

mengtaoan
Автор

Hi. Why the "Key words :" were NOT extracted from the document? See on 6.43.

arunaslipnickas
Автор

2x playback speed really improves the pacing.

KilgoreTroutAsf