Extracting data from PDF files using Python

preview_player
Показать описание
【Online Courses】

I introduce the PyPDF2 package, which we need to install.

Installation on Anaconda:
conda install -c conda-forge pypdf2

Installation using the pip installer:
pip install PyPDF2

I show you how to create and activate a virtual environment (which is optional – but useful to do). Then we develop the code step-by-step. This will enable you to learn how to modify the code to suit your specific requirements. Please leave a comment if you have any questions.

Finally, we will refactor the code. We define a function that takes a search term and filename and returns a tuple containing the total number of occurrences and the number of pages that contain the search term at least once.

*Chapters*
0:00 Welcome
0:15 Return all occurrences & page numbers
0:44 Example PDF
2:23 Python setup
3:55 Virtual environment
6:16 Coding fun
28:05 Refactoring

*The channel*
YUNIKARN focuses on publishing educational content in applied statistics, mathematics, and data science. In these fields, programming skills have become essential. Hence, we cover various programming languages including Python, Stata, and C++ to tackle problems and for fun.

*Stay in touch*

*Hashtags*
#datascience #python #PDF
Рекомендации по теме
Комментарии
Автор

Excellent video, congratulation.

Is possible make a search many words in same line?

Example:

From: Paulo Feitosa
Sent: quinta-feira, 1 de dezembro de 2022 17:48

I have a PDF with may words From and Sent, i want search it and also a line PDF doc.

SuperPaulofeitosa
Автор

Fantastic tutorial, thanks. I wonder how if we want to search multiple search terms and by the end make a table (csv) out of it? thanks

agustincsn
Автор

Hi there, thanks for the great video. Is there any way we can pick up the words/terms that occur the most? instead of searching for the word, ask python to show us like the top 10 or 20 words that repeat the most

mirof
Автор

Hi. I want to extract only paragraph and title without any table and figure from multiple pdf file. How can I solve this?

kibtiachowdhury
Автор

Hi y'all! Thank you very much for this video. I've tried for hours to write a script that's doing exactly what you explain here. I've had almost given up but then my YouTube algorithm brought me here to the most comprehensive pypdf search string tutorial I've seen so far. However, I keep running into this freaking "TypeError: a bytes-like object is required, not 'dict'" which seems to be a thing with pypdf2 and python3. I've already researched for quite a while on this topic and just couldn't solve it. Since this video is relatively new, maybe there's hope that you or somebody else in here knows what to do? Thanks anyway, great tutorials!

michaelobrist
Автор

Hi. Thnx for your video. Is it possible to extract multiple search terms from multiple pdf files at a time?

saeedewu
Автор

Can we identify a table in the PDF and represent the same in a tabular format?

hariprasad-chqc
Автор

Greetings, Great video tutorial. I have a question: I was able to search for a string of words using this code without any modifications. What I would like to do is return something based on the search words. For example: If I'm searching for the date something occurred, there is typically a preceding string. "Date of Service" should have a date following that string. How do I return the date just following that string? "Date of Service" 01/05/2019 for example. I want to return the date: 01/05/2019. There are 2 changes that would need to occur. How to return the date given it's not the search being made and since it is not a string. would we need to change the str anywhere in the code?

catesconsultinggroupllc
Автор

Then how do you put that Director 31 times into an output table? I am trying to extract specific data from PDFs, for example, it would extract all rent expenses from a Financial Statement and tabulate the numbers into an output table. Any ideas?

michaelmraz
Автор

Hey! Thank you so much for such a wonderful video. I have a question, what if we have different purchase orders in different formats? How can we get the specific information out of them using python. I am doing a college year project and unable to proceed.

alvin
Автор

Is it possible to extract only text that is in red color font from pdf by using font ???

academysolution
Автор

How to convert different tables data in scanned image pdf into excel csv file

umamaheswararaom
Автор

Excellent class. but how could I find words and select an entire sentence containing the same. Walter from Brazil

rivaltersilva
Автор

Hi. Thanks for a very helpful tutorial.
Would it be possible to search for several strings at the same time and get an output something along these lines:
Word A was found X times on pages x, y, z
Word B was found X times on pages x, y, z
And so on?

Also, on top of that, could one run this script on several PDF files at the same time to get an output along these lines:
Word A was found X times on pages x, y, z in document1
Word A was found X times on pages x, y, z in document2
Word B was found X times on pages x, y, z in document10

I'm a Python newbie so apologies in advance if my quesitons are stupid.

mrreian
Автор

Hi, may I know what python version you are currently using in this video? I am using 3.8 version, however I am not sure why, I think the extractText() functions seems to be obsolete.

yck
Автор

Sir, thanks for the great service, can you help me, if I want to extract data of each word into excel from pdf.

tedmac
Автор

Hi, this video is super helpful for understanding the process, thank you! However, when I run the code, I keep getting this exception: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead." So I changed PdfFileReader to PdfReader in the code and then it said: reader.getNumPages is deprecated and was removed in PyPDF2 3.0.0. Use len(reader.pages) instead." I'm a little confused on how to change the code from here or what exactly to change to len(reader.pages) because substituting it into the existing code didn't work. Do you have any suggestions? Did PyPDF2 change?

feliciak
Автор

Superb content Michael! Could you please remove the ")" from github-repo link?

juhaszat
Автор

How to extract pdf tables files into excell?

walkwithus
Автор

how to install pip for virtual environment

harishbollineni