Extract Text from any PDF File in Python 3.10 Tutorial

preview_player
Показать описание
Today we will be learning how we can extract the text from PDF files in Python 3.10, so that we can later process that text in any way we please.

▶ Become job-ready with Python:

▶ Follow me on Instagram:
Рекомендации по теме
Комментарии
Автор

In some of the latest updates to PyPDF2 the class "PdfFileReader" got replaced with "PdfReader". Code still works fine with "PdfReader". :)

tobiwie
Автор

Awesome, so helpful! That's much simpler and ready-to-use compared to all others approaches found online. Is there a way to export the extracted text to a csv or xlsx file?

frapsg
Автор

Just amazing explanation, short and sweet!

vitaliibaglaiev
Автор

The code did not work for me on a Windows 11 PC. I kept having ChatGPT analyze the code and error messages and after many tires it fixed it:

import os
import PyPDF2
import re
import math

def str) -> [str]:
# Open the PDF file of your choice
with open(pdf_file, 'rb') as pdf:
reader = PyPDF2.PdfReader(pdf)
pdf_text = []

for page in reader.pages:
content = page.extract_text()
pdf_text.append(content)

return pdf_text


def main():
extracted_text =
for text in extracted_text:
print(text)


if __name__ == '__main__':
main()

davet
Автор

How to extract data from more than one PDF file and put it in a table

albeeshi
Автор

Do you have any solution for pdfs with characters because when I try to apply this solution on those pdfs it prints gibberish characters.

gulfamhussain
Автор

Hey, I have some 600 files which have large volume of data, text extraction using pypdf2 is taking a lot of time, is there any other way to do this ?

rishikeshchava
Автор

Thanks for the awesome tutorial. Please do the video for two sided pdfs. Which wasnt there on youtube🙃

vishnumuralidhar
Автор

I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.

gvenagas
Автор

Thank you for the awesome tutorial. I have a some question about extracting articles. I hope you can help me. While extracting articles and reports there are many references and table legends, titles which is not required. Would it be possible to remove all those references and table contents including legends and titles when extracting the pdf file?

Miyazaki
Автор

Hi sir..is it Work on Local Language Like Telugu

Sathishedutech
Автор

Nice tutorial, how can i get the cordinates of the text in my pdf file?

kevinmakumbe
Автор

I am pretty sure there are over a thousand isntances of the word "coffee" in the pdf. However, this seems to have only counted the number of pages that the word appeared.

jvwee
Автор

I keep on getting Syntax Error: unmatched ')' on line 4 I'm running python 3.9 could that be the case?

zainsaqib
Автор

Will it work on Arabic language and will it be able to extract hand written manuscript?

MedoHamdani
Автор

I wrote the code line per line, word for word but it continue to give me File not found, how it's possible?
p.s. I managed to extrat text, the only problem is the layout of the answer, i have a string long miles

gianlucagiannetto
Автор

what if we want to extract text for any particular page

atharkhalid
Автор

how do you add the pdf file to the project?

louis
Автор

please the resolution of your screen is not clear

raniarasmy
Автор

no idea how this is setup kina pointless where is pypdf do i get it from inside my bum bum? and what is this program?

Baka_Oppai
join shbcf.ru