Using Bayesian Generative Models with Apache Spark to Solve Entity Resolution Problems at Scale

Показать описание

As the size of data generated grows exponentially in different industries such as Healthcare, Insurance, Financial Services, etc. A common challenging problem faced across this industry verticals is how to effectively or intelligently identify duplicate or similar entity profiles that may belong to the same entity in real life, but represented in the organization’s datastore as different unique profiles. This could happen due to many reasons, from companies getting acquired or merging, to users creating multiple profiles or streaming data coming in from different marketing campaign channels. Organizations often wish to identify and deduplicate such entries or match up two records present in their datastore that are nearly identical (i.e. records that are fuzzy matches). This task presents an interesting challenge from the standpoint of computational complexity – with a very large dataset (greater than ~10 million) doing a brute force element-wise comparison will result in a quadratic complexity and is clearly not feasible from a resource and time perspective in most cases. As such, different approaches have been developed over the years including those that utilize (among others) regressions, machine learning, and statistical sampling. In this talk, we will discuss how we have used the Bayesian statistical sampling approach at scale to match records using a combination of KD-tree partitioning for efficient distribution of datasets across nodes in the Spark cluster, attribute similarity functions, and distributed computing on Spark.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Connect with us:

Рекомендации по теме

Комментарии

Hi @Databricks team, is it possible to get access to code or notebooks? I would like to test it in my research.

Sebaslv

Using Bayesian Generative Models with Apache Spark to Solve Entity Resolution Problems at Scale

Using Bayesian Generative Models with Apache Spark to Solve Entity Resolution Problems at Scale

Bayesian Inference in Generative Models

Bayesian Image Reconstruction using Deep Generative Models - Razvan Marinescu

Cross-Situational Learning with Bayesian Generative Models

Cross-Situational Learning with Bayesian Generative Models : Action description task

Bayesian Semi-Supervised Learning with Deep Generative Models

#62 Bayesian Generative Modeling for Healthcare, with Maria Skoularidou

ODE2VAE: Deep generative second-order ODEs with Bayesian neural networks

Bayesian ML (2021). Lecture 9: Generative Models. VAE

Introduction to Bayesian data analysis - Part 2: Why use Bayes?

Cross-Situational Learning with Bayesian Generative Models : Action generation task

IAML2.23: Generative vs. discriminative learning

A beginners guide to Bayesian Cognitive Modelling

Lecture 2: Generative Bayesian Models for Discrete Data

Cross-Situational Learning with Bayesian Generative Models : Cross-situational learning task

Stanford CS224W: ML with Graphs | 2021 | Lecture 15.1 - Deep Generative Models for Graphs

Been Kim: Bayesian Case Model

Introduction to Bayesian data analysis - part 1: What is Bayes?

Lecture #9a: Generative Models; Naive Bayes on 11/13/2019 Wed

A Bayesian Approach to Media Mix Modeling (Michael Johns & Zhenyu Wang)

Evgeny Burnaev - Deep Bayesian Generative Models for Knowledge Transfer and MRI Processing

Generative models: GDA, Naive Bayesian, and Generative adversarial networks (GAN)

Generative vs Discriminative Models ! Clearly Explained ! 🔥🔥🔥

Andrew Ng's Secret to Mastering Machine Learning - Part 1 #shorts