Corpus Conversion Service: A machine learning platform to ingest documents at scale

Показать описание

Authors:
Peter W J Staar (IBM); Michele Dolfi (IBM); Christoph Auer (IBM); Costas Bekas (IBM)

Abstract:
PDF is by far the most prevalent document format today. There are
roughly 2.5 trillion PDFs in circulation [
1
] such as scientific pub-
lications, manuals, reports, contracts and more. However, content
encoded in PDF is by its nature reduced to streams of printing in-
structions purposed to faithfully present a visual layout. The task of
automatic content reconstruction and conversion of PDF documents
into structured data files has been an outstanding problem for over
three decades [
2
,
3
]. Here, we present a solution to the problem
of document conversion, which at its core uses trainable, machine
learning algorithms. The central idea is that we avoid heuristic or
rule-based (RB) conversion algorithms, using instead generic ma-
chine learning (ML) algorithms, which produce models based on
gathered ground-truth data. In this way, we eliminate the continuous
tweaking of conversion rules and let the solution simply learn how to
correctly convert documents by providing enough ground truth. This
approach is in stark contrast to current state of the art conversion
systems (both open-source and proprietary), which are all RB.
While a machine learning approach might appear very natural
in the current era of AI, it has serious consequences with regard to
the design of such a solution. First, one should think at the level
of a document collection (or a corpus of documents) as opposed to
individual documents, since an ML model for a single document is
not very useful. An ML model for a certain type of documents (e.g.
scientific articles, regulations, contracts, etc.) obviously is. Secondly,
one needs efficient tools to gather ground truth via human annotation.
These annotations can then be used to train the ML models. It is
clear then that leveraging ML adds an extra level of complexity:
One has to provide the ability to store a collection of documents,
annotate these documents, store the annotations, train models and
ultimately apply these models on unseen documents. For the authors
of this paper, this implied that our solution cannot be a monolithic
application. Rather it was built as a cloud-based platform, which
consists out of micro-services that execute the previously mentioned
tasks in an efficient and scalable way. We call this platform
Corpus
Conversion Service
(CCS).

Рекомендации по теме

Комментарии

Great work! Looking forward to using this service that delivers the best of machine learning in challenging real-world applications, like knowledge discovery...

DionysiosDiamantopoulos

Corpus Conversion Service: A machine learning platform to ingest documents at scale

Corpus Conversion Service: A machine learning platform to ingest documents at scale

EPYC Server Noctua Fan Conversion

Corpus Alignment for Machine Translation | Pitch

NLP 1: Basics Part 1| Corpus and Word cloud development

Machine Learning Course - 21. ML Design Pattern - Corpus Centric

Best Free Speech-To-Text APIs and Open Source Libraries

TotalEnergies - AdBlue® and SCR 3D video

5 Reasons to Take the Trimmer Guard OFF Your Weedeater

Most People Don't Know This About Trailer Tires

SWIFT Payment System Explained

How Does Slime Tire Sealant Work?

ECU Reflash vs. Power Commander For Engine Tuning | The Shop Manual

Mini Splits vs. Central Air Conditioners Compared | Sylvane

Everything you need to know about Thru Axles

This tool will help us get to zero emissions

Steam Heating Systems Basics hvacr

Building a Solar Powered Workshop

How to Translate Using Your Google Pixel Buds Pro

3 Things They Don’t Tell You About Tankless

Servers vs Desktop PCs as Fast As Possible

How to Change the Easy Change Oil System | John Deere

What is the BEST Fuel to Use in Your Car or Truck and WHY

AIRCRAFT CONVERSION XXL - A cargo plane is born | Full Documentary

What is a kilowatt hour? Understanding home energy use