Extract and Visualize Data from PDF Tables with PDFplumber in Python

preview_player
Показать описание

By using PDFplumber, I was able to create a graph which shows the trend at the center of my article. I hope some of you can take something away from this walkthrough that will help you supplement your own reporting, especially if you're interested in data journalism.

I'm by no means an expert coder, very much a beginner, so if there are things I could have done better let me know. That being said, I hope this walkthrough proves that any journalist can use programming to enhance their work, so you should try it if you haven't already!

#python #walkthrough #journalism
Рекомендации по теме
Комментарии
Автор

This is amazing stuff. God bless you. Keep up the good work

virajmoghe
Автор

I'm watching your video from Madagascar. Great job, thank you!

ramarisonandry
Автор

Great video! Do you know if the extract tables functionality needs the tables to be ruled?

bxroberts
Автор

If you are interested in pdf table extraction, give "camelot" library a try. I found it superior than PDFplumber in terms of automatic table identification. It could detect bank statement tables without explicit lines and empty cells. Also, the resulting object is already a pandas Dataframe, so you can select and clean the data in the usual pandas way.

kw
Автор

Not sure how to choose from the many python packages to extract data from a PDF.. PyMuPDF, PyPDF2, PDFplumber, tabula-py, etc..
For example, what if the PDF is a scan of a paper document.. i.e. it's crooked, and quality is bad. Is there one that does it best? Or maybe I should use AI (ChatGPT + GPT4Vision/Ai PDF) to do an OCR, then have it extract the data?

Also any suggestions how to get the values from specific columns in a text file. For example, I have a text file with data like this:

#Time (HHH:MM:SS): 002:34:02
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# === ==== ==== ==== == ==== == == ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000

#Time (HHH:MM:SS): 002:34:03
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# === ==== ==== ==== == ==== == == ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000


How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?

bennguyen
Автор

I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.

gvenagas