Convert Trapped Tables within PDFs to Pandas DataFrames

Показать описание

Рекомендации по теме

Комментарии

You said: "it's trial and error, until you get it right"
I think that's why "camelot" is better. You can get visual output (with matplotlib) so you don't need to guess iteratively.

ilianos

Do you think Tabula work for all generated text pdf?

kompheakmom

Not sure how to choose from the many python packages to extract data from a PDF.. PyMuPDF, PyPDF2, PDFplumber, tabula-py, etc..
For example, what if the PDF is a scan of a paper document.. i.e. it's crooked, and quality is bad. Is there one that does it best? Or maybe I should use AI (ChatGPT + GPT4Vision/Ai PDF) to do an OCR, then have it extract the data?

Also any suggestions how to get the values from specific columns in a text file. For example, I have text files with data like this:

#Time (HHH:MM:SS): 002:34:02
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# === ==== ==== ==== == ==== == == ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000

#Time (HHH:MM:SS): 002:34:03
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# === ==== ==== ==== == ==== == == ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000

How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?

bennguyen

1) 0:49 CMD (as Admin): pip install tabula-py. (java installed previously)
2)

romniyepez

AttributeError: module 'tabula' has no attribute 'read_pdf' everytime it is showing this error

aarishqureshi

Convert Trapped Tables within PDFs to Pandas DataFrames

Convert Trapped Tables within PDFs to Pandas DataFrames

Stuck juggling complex data trapped inside a PDF ? 😎

Using AI to Unlock Data Trapped in a PDF

Python Project to extract text from 'Locked' PDF files with PyPDF2 and Gradio... Code Inc...

How To Create a Fillable PDF Form For FREE!

How to Extract Data from Scanned PDF Forms on Windows

Convert Amazon Kindle, ePub, .Mobi and/or Locked PDF to ANYTHING!!

Do NOT Shut Down Your Computer! (here's why)

How to make a locked PDF from MS Word on Mac OS X

The truth about hydrogen

5 Periodic Tables We Don't Use and Why

D3.js in 100 Seconds

Your Brain Will Be Grateful For These 18 Riddles💡

The Man With The Seven Second Memory

$2,000 Electric Bike Stolen 10 Minutes After Being Locked Up

Working with locked documents in Power Automate Flows #PowerAutomate

How to Convert Image to Editable text

Liquid Sand Hot Tub- Fluidized air bed

How to scan text into Notes on iPhone and iPad | Apple Support

Easily UNCRUMPLE Scanned Documents in Photoshop!

How to put Pizza on the Peel

How does a Pull-Back Toy Car work?

Ghostbusters 'Surge Protector' Trap | Shanks FX | PBS Digital Studios

Why do we get PIMPLES? (3D Animation) #Shorts