Tim Allison – What's new in Apache Tika 2.0 -- we mean it this time!

preview_player
Показать описание
Apache Tika is used in big data document processing pipelines to extract text and metadata from numerous file formats. Text extraction is a critical component for search systems. While work on 2.0 has been ongoing for years, the Tika team released 2.0.0-ALPHA in January and will release 2.0.0 before Buzzwords 2021. In addition to dramatically increased modularization, there are new components to improve scaling, integration and robustness. This talk will offer an overview of the changes in Tika 2.0 with a deep dive on the new tika-pipes module that enables synchronous and asynchronous fetching from numerous data sources (jdbc, fileshare, S3), parsing and then emitting to other endpoints (fileshare, S3, Solr, Elasticsearch, etc).

Speaker:

Рекомендации по теме
Комментарии
Автор

Absolutely ruined the Tika Server, sending url's aren't working anymore.
Can't use it anymore. Documentation is a mess.
Too complicated for a non java user to get involved.

greendsnow