Dimiter Naydenov - Extracting Tabular Data from PDFs with Camelot and Excalibur

Показать описание

"Extracting Tabular Data from PDFs with Camelot and Excalibur
[EuroPython 2019 - Talk - 2019-07-10 - Osaka / Samarkand [PyData track]
[Basel, CH]

By Dimiter Naydenov

Portable Document Format (PDF) is commonly used to produce, publish, exchange, and
archive business and academic documents alike. Often in such PDFs there are tables
with data that you want to extract and process in some automated fashion. Unlike HTML
or other formats, PDF has no concept of tables as rows and columns with related data.

Tables in PDFs are rendered to visually resemble a table (when printed) using low-level
instructions to place the text of each table cell where it should be, while the original
tabular structure is lost.

While there are existing solutions to extract structured data from PDFs, most of them
are expensive proprietary products or hosted online services, not Python-based, not
open-source, and give you little control over the process, or how your sensitive
PDF documents are handled.

In this talk I'll present two open-source Python tools for PDF tables extraction, the
CLI tool Camelot, and its web-based frontend UI - Excalibur. I'll show you how to
install both locally, and how to use them to extract tabular data from PDFs with ease.

Extraction under your control: 1) define rules with areas on the PDF page containing
the table you want to extract; 2) save and reuse the rules to automate / batch-process
similar PDFs; 3) export the extracted tables as CSV, Excel, JSON, HTML, or use directly
as pandas DataFrames.

If you find Camelot and Excalibur useful, please consider supporting those projects,
or even get involved as a contributor!