Using Apache Spark and Differential Privacy for Protecting the Privacy of the 2020 Census Respondent

Показать описание

The goal of the 2020 Census is to count every person in the US, once, and in the correct place. The data created by the census will be used to apportion the US House of Representatives, to draw legislative districts, and distribute more than $675 billion in federal funds. One of the data challenges of the 2020 Census is to making high-quality data available for these purposes while protecting respondent confidentiality. We are doing this with differential privacy, a mathematical approach that allows us to balance the requirements for data accuracy and privacy protection. We use a custom-written application that uses Spark to perform roughly 2 million optimizations involving mixed integer linear programs, running on a cluster that typically has 4800 CPU cores and 74TB of RAM. In this talk, we will present the design of our Spark-based differential privacy application, and discuss the application monitoring systems that we built in Amazon's GovCloud to monitor multiple clusters and thousands of application runs that were used to develop the Disclosure Avoidance System for the 2020 Census.

Connect with us:

Комментарии

That's easy stuff.... I do Tesla stuff light generated That's normal

killacrush