Extract and Visualize Data from PDF Tables with PDFplumber in Python

Показать описание

By using PDFplumber, I was able to create a graph which shows the trend at the center of my article. I hope some of you can take something away from this walkthrough that will help you supplement your own reporting, especially if you're interested in data journalism.

I'm by no means an expert coder, very much a beginner, so if there are things I could have done better let me know. That being said, I hope this walkthrough proves that any journalist can use programming to enhance their work, so you should try it if you haven't already!

#python #walkthrough #journalism

Рекомендации по теме

Комментарии

This is amazing stuff. God bless you. Keep up the good work

virajmoghe

I'm watching your video from Madagascar. Great job, thank you!

ramarisonandry

Great video! Do you know if the extract tables functionality needs the tables to be ruled?

bxroberts

If you are interested in pdf table extraction, give "camelot" library a try. I found it superior than PDFplumber in terms of automatic table identification. It could detect bank statement tables without explicit lines and empty cells. Also, the resulting object is already a pandas Dataframe, so you can select and clean the data in the usual pandas way.

kw

Not sure how to choose from the many python packages to extract data from a PDF.. PyMuPDF, PyPDF2, PDFplumber, tabula-py, etc..
For example, what if the PDF is a scan of a paper document.. i.e. it's crooked, and quality is bad. Is there one that does it best? Or maybe I should use AI (ChatGPT + GPT4Vision/Ai PDF) to do an OCR, then have it extract the data?

Also any suggestions how to get the values from specific columns in a text file. For example, I have a text file with data like this:

#Time (HHH:MM:SS): 002:34:02
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# === ==== ==== ==== == ==== == == ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000

#Time (HHH:MM:SS): 002:34:03
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# === ==== ==== ==== == ==== == == ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000

How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?

bennguyen

I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.

gvenagas

Extract and Visualize Data from PDF Tables with PDFplumber in Python

Extract and Visualize Data from PDF Tables with PDFplumber in Python

Extract and Visualize Data from URLs using Unfurl w/ Ryan Benson - SANS DFIR Summit 2020

Data Visualization Tutorial For Beginners | Big Data Analytics Tutorial | Simplilearn

Science of Data Visualization | Bar, scatter plot, line, histograms, pie, box plots, bubble chart

Jorge De la Cruz | How to Consume, Extract and Visualize Data with InfluxDB & Grafana | InfluxDa...

Coronavirus Data Extraction & Visualization (COVID-19)

Visualization Step-By-Step Series: Part 2 - Extracting data

Extract & Visualize Data with ChatGPT: OCR Analysis Magic!

Power BI Experience 05

Build with Me: Visualize Data using Amazon QuickSight | AWS Project

Python Project to Scrape YouTube using YouTube Data API | Analyze and Visualize YouTube data

Introduction to Pivot Tables, Charts, and Dashboards in Excel (Part 1)

How to find connected papers and visualize them in an interactive graph

A Beginners Guide To The Data Analysis Process

D3.js in 100 Seconds

Extract any stock's data and visualize the data using Python (No need to install Python or a ID...

#18 ABAQUS Tutorial: Visualization and extracting results in ABAQUS

Real Data Visualization with Python, matplotlib, numpy, pandas

Data Visualization | Data Visualization Python | Intellipaat

ggplot for plots and graphs. An introduction to data visualization using R programming

RESTful APIs – How to Consume, Extract, Store and Visualize Data with InfluxDB and Grafana

Turn An Excel Sheet Into An Interactive Dashboard Using Python (Streamlit)

Extract web data with import.io and create d3 visualization

273 Crowdsourcing based Data Extraction from Visualization Charts