Using Bayesian Generative Models with Apache Spark to Solve Entity Resolution Problems at Scale

preview_player
Показать описание
As the size of data generated grows exponentially in different industries such as Healthcare, Insurance, Financial Services, etc. A common challenging problem faced across this industry verticals is how to effectively or intelligently identify duplicate or similar entity profiles that may belong to the same entity in real life, but represented in the organization’s datastore as different unique profiles. This could happen due to many reasons, from companies getting acquired or merging, to users creating multiple profiles or streaming data coming in from different marketing campaign channels. Organizations often wish to identify and deduplicate such entries or match up two records present in their datastore that are nearly identical (i.e. records that are fuzzy matches). This task presents an interesting challenge from the standpoint of computational complexity – with a very large dataset (greater than ~10 million) doing a brute force element-wise comparison will result in a quadratic complexity and is clearly not feasible from a resource and time perspective in most cases. As such, different approaches have been developed over the years including those that utilize (among others) regressions, machine learning, and statistical sampling. In this talk, we will discuss how we have used the Bayesian statistical sampling approach at scale to match records using a combination of KD-tree partitioning for efficient distribution of datasets across nodes in the Spark cluster, attribute similarity functions, and distributed computing on Spark.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Connect with us:
Рекомендации по теме
Комментарии
Автор

Hi @Databricks team, is it possible to get access to code or notebooks? I would like to test it in my research.

Sebaslv