Extract Text from PDF with Python

preview_player
Показать описание
In this video we learn how to extract text from a PDF file with Python using PyPDF2. We also learn how to convert PDF to a text file. We start off with a simple example of extracting text from a single page. We then extract the text from all the pages in the pdf. After this we use an example of getting text from pages that meet a certain condition (i.e., containing the word Waldo). With this example we learn how to extract text from multiple PDF pages that we specified. Next we write those extracted PDF Pages to a new PDF document. Finally we extract only the sentences that contain Waldo and the pages that those sentences were located on.

This is based on a real project I did for work where I had to extract pertinent information about specific people from thousands of PDFs that contained many pages each.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
$15 off Annual Dataquest subscription
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Udemy Recommendations that I have Personally Taken (affiliate links):

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

More or my videos You may be interested in

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

References

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

0:00 Intro - Where's Waldo
0:36 pip install
0:59 Extract Text
1:20 Step 1
2:09 Step 2
2:58 All Pages to txt
4:20 Where's Waldo Pages
5:51 Write to PDF
6:21 Get Text from Specific Pages
8:15 Waldo Sentences
Рекомендации по теме
Комментарии
Автор

We love the work you do, Probably you might save someone's day everytime when you upload a new video Thank you 😊The way you include all the viewers in those appreciations after completing a task is awesome, ohhhoooo... we did great👍

MURALIKRISHNAhai
Автор

Amazing. Clear explanation of what's being done.

Subscribed.

TheBtrivedi
Автор

Very nicely explained, I would like to know if page in pdf has header or footer and extract Page No's which has header/footer. Can we have this scripted using pyPDF2?Please advise

shilpakale
Автор

Hello, I found a small bug in the code. If 'Waldo' exists in two places on the same page (in different sentences, the second 'Waldo' is not found. Can you provide a fix? Thanks!

johnkhan
Автор

Amazing tutorial. I noticed there is a /n at the end of each line. Is there anything we can do to detect the whole paragraph?

cbao
Автор

And if I only eant to extraxt some keywords across multiplepages in 100 pdfs what I might do for it? I dont want all the text, only few words

KyroAtelerix
Автор

hi, very nice video ^^... but will this procedure work if I need to extract certain text strings in PDFs generated from Autocad Drawings? thanks ^^

vrbaac
Автор

Hey, I'm new to this and using Thonny to edit and run code. When I get to exacting the text, a notepad file is opened but the text from the PDF is not written there. Any clue why this would happen.

adamrassi
Автор

Does the code work if there are multiple keywords in the same sentence?

yizzi
Автор

In case of page extraction say we want to extract page 2 and page 5..do we use in getobj.pageS(2, 5)

ajsunofficial
Автор

Sir, can you guide me how to extract text (of specific coordinates) from pdf file ?

saurabhverma
Автор

Hi duddy, is possible I read line by line? if yes, how I can?

lasnroo
Автор

How about using a for loop to extract a text title to another for multiple pdfs?

JM-frbc
Автор

Please how to avoid the line break problem for some paragraphs in your result ??

cars_worldcw
Автор

How can take only highlight text in pdf

kingfunny
Автор

i can't get my code to find the pdf i'm trying to use. does it need to be saved somewhere in particular?

TheKylesauce
Автор

how you installed pypdf, when i wrote it it says not found

Atharv-wmvr
Автор

New to Python coding. Sorry for the stupid questions: I have ran the following CMD:
pip install PyPDF2
Collecting PyPDF2
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
232.6/232.6 kB 4.8 MB/s eta 0:00:00
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1

IDLE throws this error:
from PyPDF2 import PdfFileReader
ModuleNotFoundError: No module named 'PyPDF2'

What am I missing???

bryanl
Автор

I guess it doesn't matter if "the" is typed as "teh" as the presenter did!

dianamarahenry