Corpus Conversion Service: A machine learning platform to ingest documents at scale

preview_player
Показать описание
Authors:
Peter W J Staar (IBM); Michele Dolfi (IBM); Christoph Auer (IBM); Costas Bekas (IBM)

Abstract:
PDF is by far the most prevalent document format today. There are
roughly 2.5 trillion PDFs in circulation [
1
] such as scientific pub-
lications, manuals, reports, contracts and more. However, content
encoded in PDF is by its nature reduced to streams of printing in-
structions purposed to faithfully present a visual layout. The task of
automatic content reconstruction and conversion of PDF documents
into structured data files has been an outstanding problem for over
three decades [
2
,
3
]. Here, we present a solution to the problem
of document conversion, which at its core uses trainable, machine
learning algorithms. The central idea is that we avoid heuristic or
rule-based (RB) conversion algorithms, using instead generic ma-
chine learning (ML) algorithms, which produce models based on
gathered ground-truth data. In this way, we eliminate the continuous
tweaking of conversion rules and let the solution simply learn how to
correctly convert documents by providing enough ground truth. This
approach is in stark contrast to current state of the art conversion
systems (both open-source and proprietary), which are all RB.
While a machine learning approach might appear very natural
in the current era of AI, it has serious consequences with regard to
the design of such a solution. First, one should think at the level
of a document collection (or a corpus of documents) as opposed to
individual documents, since an ML model for a single document is
not very useful. An ML model for a certain type of documents (e.g.
scientific articles, regulations, contracts, etc.) obviously is. Secondly,
one needs efficient tools to gather ground truth via human annotation.
These annotations can then be used to train the ML models. It is
clear then that leveraging ML adds an extra level of complexity:
One has to provide the ability to store a collection of documents,
annotate these documents, store the annotations, train models and
ultimately apply these models on unseen documents. For the authors
of this paper, this implied that our solution cannot be a monolithic
application. Rather it was built as a cloud-based platform, which
consists out of micro-services that execute the previously mentioned
tasks in an efficient and scalable way. We call this platform
Corpus
Conversion Service
(CCS).

Рекомендации по теме
Комментарии
Автор

Great work! Looking forward to using this service that delivers the best of machine learning in challenging real-world applications, like knowledge discovery...

DionysiosDiamantopoulos