Text parsing and matching with HPC resources

Показать описание

This talk provides a brief introduction to some of the core concepts of analyzing text using computational tools. We demonstrate how standard calculations can be scaled to work on very large data sets through simple parallelization strategies that are easy to deploy in an HPC environment using job arrays.

These ideas are illustrated by a concrete example implemented in Python using the pandas, re, and nltk libraries. The example we tackle comes from social science research where multiple data sets refer to the same individuals and they need to be merged while accounting for deviations in how individuals are named or described. In order to illustrate a typical solution, we will demonstrate 3 key steps:
- text parsing and cleaning with data frames and regular expressions
- a parallelization strategy using blocking keys
- approximate text matching, string similarity measures, and reduction to a well-defined machine learning problem.

This problem and solution process are representative of a very large class of data analysis problems that involve text comparison. We close by indicating some powerful extensions to the presented solution that can be used to apply this overall strategy to more complex problems of text analysis.

To view / download the slides from this presentation, visit:

For information on other WestGrid events, visit:

Other ways to connect with WestGrid:
Twitter - @WestGrid