DocEng 2011: An Efficient Language-Independent Method to Extract Content from News Webpages

Показать описание

The 11th ACM Symposium on Document Engineering
Mountain View, California, USA
September 19-22, 2011

An Efficient Language-Independent Method to Extract Content from News Webpages
Eduardo Teixeira Cardoso, Iam Jabour, Eduardo Laber, Rogério Ferreira Rodrigues, Pedro Lazéra Cardoso
Presented by Eduardo Teixeira Cardoso.

ABSTRACT

We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.