PII Detection at Scale on the Lakehouse

preview_player
Показать описание
SEEK is Australia’s largest online employment marketplace and a market leader spanning ten countries across Asia Pacific and Latin America. SEEK provides employment opportunities for roughly 16 million monthly active users and process 25 million candidate applications to listings. Processing millions of resumes involves handling and managing highly sensitive candidate information, usually inputted in a highly unstructured format. With recent high-profile data leaks in Australia, personally identifiable information (PII) protection has become a major focus area for large digital organizations.

The first step is detection, and SEEK has developed a custom framework built using HuggingFace transformers fine-tuned with nuances around employment. For example, “Software Engineer at Databricks” is not PII, but “CEO at Databricks” is PII. After identifying and anonymizing PII in stream and batch data, SEEK uses Unity Catalog’s data lineage to track PII through their reporting, ETL, and other downstream ML use-cases and govern access control achieving an organization-wide data management capability driven by deep learning and enforcement using Databricks.

Talk by: Ajmal Aziz and Rachael Straiton

Рекомендации по теме
Комментарии
Автор

Great presentation, I reference this for Data Governance conversations.

Better title for slide at 18:34 "Cost and performance scale *better* than linearly on Databricks and spark"

douglasmoore