How to make OCR PDFs on Windows using Tesseract

Показать описание

It's free, it's easy, it's Tesseract, which is an Optical Character Recognition (OCR) engine that detects text in images and overlays the text onto PDFs. Here's how to do it in as short as a tutorial as possible. Medium amount of technical knowledge is helpful.

0:00 Introduction
0:48 Tesseract
1:32 PATH variable
2:38 ImageMagick
3:21 Python
3:49 GhostScript
5:08 How to run the script

Here are the links for the video:

Here's what each piece of software is doing:
Tesseract: It's what is actually doing the OCR, and putting the text onto images in PDFs. The problem is that Tesseract only takes images as input, so...
ImageMagick: It converts PDFs into a series of PNG images. The problem is it actually needs...
GhostScript: Which provides the tools and libraries that ImageMagick uses. Then there's...
Python: Basic scripting language that is used to run the script I wrote. And that's...

Рекомендации по теме

Комментарии

For those who run the script and it gives an error, the cause is likely that you're using a newer version of Pdfmerger so to fix that you should first use "from PyPDF2 import PdfFileReader, PdfFileWriter, PdfMerger" and then ctrl + f to find where it says "merger = PdfFileMerger()" and change it to "merger = PdfMerger()" that should fix the problem :). I hope the OP sees this and posts an updated comment/python file. This video was incredibly useful and well done. Much thanks

alekhinesgun

Thank you so much, it worked! For those having trouble, when you open cmd, make sure you are in the directory of the folder that contains the pdf file (e.g., C:\Users\James> cd Then that is when you do the pip install PyPDF2.

anaranjo

Danke! Beste Anleitung überhaupt. Kurz, knapp auf den Punkt! Einfach Perfekt!

Daniel-unii

Thank you so much! I really appreciate how much effort you put into the video, especially with the captions too!

David-wwsg

I've caught in many erroros and struggled with it for almost half an hour. I've found a few changes about the script and some tips. so I'd like to tell other guys to help.

1. As many guys have mentioned, PdfFileMerger changed into PdfMerger. So we have to replace it.

2. And also, I'd got a parameter error. I thought it seens to be occured at converting phase which is using ImageMagick. so I searched about it and found that 'ImageMagick CLI command is 'magick' on Windows. So I'd changed and It finally worked. But 'convert' command worked at this clip, so I recommend it if 'convert' command were not working.

3. if you want to OCR with a specific language that is not English, find this line

tesseract = 'tesseract "' + combined_pic + '" "' + combined_pic + '-ocr" PDF'

and insert a language option;

-l LANGUAGE_WHAT_YOU_WANT between '-ocr" and "PDF', so result is;

tesseract = 'tesseract "' + combined_pic + '" "' + combined_pic + '-ocr" -l LANGUAGE_WHAT_YOU_WANT PDF'

if you want to OCR with multiple languages use + between languages;

-l LANG01+LANG02

I hope it could help other guys who are in trouble.

elegantcat

Thank You for providing all the links of downloads, It worked for me
👍

vaishalimahajan

Thank you so much! # IF YOU HAD AN ERROR WITH invalid parameter -150 close the files and then do it again. It might take also some time depending on file size.

kwizerafrank

OMG! It work. Thanks for doing this video and for all of the software.

PhatNguyen-oqbd

an absolute G, thanks for saving lifes

andriikorniienko

@2:23 - After years of going from Win3.1 to win95 to win98 to winXP-Vista-7-8- and now Win10 - I can say this about PATH Variables: In order to keep everything separate and looking nice and working in the easiest way possible - you should always make a NEW path variable named "XXX-path" (like Tesseract-path) and put the path into THAT variable - AND THEN - edit the PATH variable and just add "%Tesseract-path%" to that variable. In this way, you can easily change the "Tesseract-path" variable and not muck up the PATH variable. Now - YES - it does make a NEW variable BUT - put it in the TOP area and not the bottom area so it is only invoked when you open a CLI (Command Line Interface or DOS window). So - the thing to think about is - what if they change where they put a program (or what if they change the name every single time they come out with a new version [like "myprog v1", then "myprog v2", then "myprog this is where it goes v3"]?). With this method all you need to do is to do the pathway selection, go to Environment Variables, find your "Tesseract-path" variable - and change the path there. It would then be automatically changed in the PATH variable. Or what if you wanted TWO versions of Tesseract? Why that's easy! You just put the new version in "Tesseract v2.x-path" and add that in to the PATH variable. Anyway - this is how I do it. It makes life simple (or simpler) if you always do it the same way. And now - back to the video. :-) Which is excellent by the way. :-)

markmanning

if your having trouble with the pip install part you need to add that script to your paths

hashasbashbash

I get the following error when I try to open the ocr-combined file: "There was an error opening this document. This file cannot be opened because it has no pages"

StanleyDenman

This video is great. I tried it. However I got stuck on the procedure you were describing on timelapse 4:35 when I needed to save the covert.py to a certain folder 'ocr-pdf'. Because I did not find such folder in my desktop (I wonder how it happened that you do have it.) Therefore where am I supposed to save the 'covert.py'? Nevetheless, I simply tried to save it just in my desktop, and then I followed the rest of the instructions. Fortunately it worked, but only once and I am mystified. It never worked again when I tried converting another scanned pdf files. I suspect its because I did not save it in folder'ocr-pdf'on my desktop. How would I have such folder? Is there any other workaround to be able for the 'covert.py to consistently work? Thanks in advance.😊

Queruwk

What an Amazing script and video ! Thank you very helpful. May Allah bless you more.
😊

mahmudrahman

Hi. Your script ran and just created an empty subfolder for every page of my PDF. The PDF itself is untouched. Could you advise what happened here?

stefansch

Does this only work with English? My output pdf is empty and the CLI shows invalid argument for each image.

kalabhairava

I installed pip with python, however, It gives pip is not a recognized command

techgalaxy

I am having serious trouble here. I don't want to screw up my laptop trying to get this to work. I have followed the instructions to a tee. There is a blip at 4:57. It appears to jump over a step. Either way, I followed it and this is what it said: "...Desktop\OCR-PDF>pip install PyPDF2
'pip' is not recognized as an internal or external command, operable program or batch file." Oh yay! So I tried it another way... "Desktop\OCR-PDF>convert.py
Traceback (most recent call last):
File "...Desktop\OCR-PDF\convert.py", line 4, in <module>
from PyPDF2 import PdfFileMerger
ModuleNotFoundError: No module named 'PyPDF2'"
So I read TWoboS's steps and that didn't work. I read in Oliver's thread below that some got it to work after rebooting, but not for me. I read that you have to add the PATH to the Desktop folder, but that didn't work either. My computer is completely up to date. Is there another way to do this? Did something get left out?

cindylloyd

My document is ins spanish, how can I choose the language to use?

ManiSalcedo

Can I use this script to convert to another language?

katietran

How to make OCR PDFs on Windows using Tesseract

How to use OCR and Scan feature | Adobe Acrobat Pro DC

How To Make Searchable Pdf Files | OCR PDF

How to use OCR to convert scanned files into editable and searchable documents on Windows

How to scan to PDF and OCR documents | Create editable and searchable PDFs from paper docs

Perform an OCR on a PDF document using Adobe Acrobat Pro DC | Pixascene

How to OCR PDF for Free Online | HiPDF

How to Perform OCR on a PDF

PDF Files - converting to OCR

How To Read PDFS in OCR C# | IronOCR

How to make OCR PDFs on Windows using Tesseract

90-Second Tutorial: Make Your PDF Accessible with OCR and Tags (Adobe Acrobat Pro)

How to OCR PDF on HiPDF Online

How to make a PDF searchable and batch OCR images

How to create a PDF from an Image and OCR Scan it | Adobe Acrobat PRO

How to Convert Scanned Image to Editable Text without using any software

How to convert PDF files to OCR format

Quickly learn how to OCR / Make your pdf files text readable

The BEST PDF TOOLS for Linux: merge, edit, create, annotate, OCR...

Read PDF Files with GitHub Copilot #pdfreader #ocr #textscanner #copilot

How to convert image to Searchable PDF with Aspose.OCR

OCR - Image to Text Converter

Scan Texts & Images | Convert to PDF with OCR | PDF Scanner, Generator & Editor App for iPho...

OrbitNote - How to OCR Scan your PDFs

Perform Optical Character Recognition (OCR) on Documents with PDF-XChange Editor