Rousillon: Scraping Distributed Hierarchical Web Data

preview_player
Показать описание
Rousillon: Scraping Distributed Hierarchical Web Data
Sarah E. Chasins, Maria Mueller, Rastislav Bodik

UIST '18: ACM User Interface Software and Technology Symposium
Session: Web

Abstract
Programming by Demonstration (PBD) promises to enable data scientists to collect web data. However, in formative interviews with social scientists, we learned that current PBD tools are insufficient for many real-world web scraping tasks. The missing piece is the capability to collect hierarchically-structured data from across many different webpages. We present Rousillon, a programming system for writing complex web automation scripts by demonstration. Users demonstrate how to collect the first row of a 'universal table' view of a hierarchical dataset to teach Rousillon how to collect all rows. To offer this new demonstration model, we developed novel relation selection and generalization algorithms. In a within-subject user study on 15 computer scientists, users can write hierarchical web scrapers 8 times more quickly with Rousillon than with traditional programming.

Рекомендации по теме