Dimiter Naydenov - Extracting Tabular Data from PDFs with Camelot and Excalibur

preview_player
Показать описание
"Extracting Tabular Data from PDFs with Camelot and Excalibur
[EuroPython 2019 - Talk - 2019-07-10 - Osaka / Samarkand [PyData track]
[Basel, CH]

By Dimiter Naydenov

Portable Document Format (PDF) is commonly used to produce, publish, exchange, and
archive business and academic documents alike. Often in such PDFs there are tables
with data that you want to extract and process in some automated fashion. Unlike HTML
or other formats, PDF has no concept of tables as rows and columns with related data.

Tables in PDFs are rendered to visually resemble a table (when printed) using low-level
instructions to place the text of each table cell where it should be, while the original
tabular structure is lost.

While there are existing solutions to extract structured data from PDFs, most of them
are expensive proprietary products or hosted online services, not Python-based, not
open-source, and give you little control over the process, or how your sensitive
PDF documents are handled.

In this talk I'll present two open-source Python tools for PDF tables extraction, the
CLI tool Camelot, and its web-based frontend UI - Excalibur. I'll show you how to
install both locally, and how to use them to extract tabular data from PDFs with ease.

Extraction under your control: 1) define rules with areas on the PDF page containing
the table you want to extract; 2) save and reuse the rules to automate / batch-process
similar PDFs; 3) export the extracted tables as CSV, Excel, JSON, HTML, or use directly
as pandas DataFrames.

If you find Camelot and Excalibur useful, please consider supporting those projects,
or even get involved as a contributor!

Рекомендации по теме
Комментарии
Автор

How can I automate this means how can I exctract the tables from multiple PDF files in a single program

veenahb
Автор

Tabula currently works better. This thing pretty much crashes when using large file... just goes off and never comes back.

massivefins
Автор

what if i need to extract doc file instead of pdf using this, ...please this

venkateswaraotella
Автор

Hi,
I am having large pdf where camelot takes lot of time to read
Is it possible to read one page at a time
Thanks

hayathbasha
Автор

cannot install the camelot package over anaconda

nipunika