Column encryption & Data Masking in Parquet - Protecting data at the lowest layer

preview_player
Показать описание
Column encryption & Data Masking in Parquet - Protecting data at the lowest layer
Pavi Subenderan, Xinli Shang

A presentation from ApacheCon @Home 2020

In a typical big dataset, only a minority of columns are actually sensitive and need to be protected. Columnar file formats like Apache Parquet allow for column level access control through encryption. This means the small number of sensitive columns in a dataset can be protected through encryption, while the non-sensitive columns can be open for access. Data masking features for encrypted columns bring further convenience and allows users to leverage encrypted columns even without access to them. The combination of column encryption and data masking maximizes accessibility to your data without compromising the security of sensitive data. In the first half, we will go over column encryption design and features in Parquet. We will cover considerations when operating parquet column encryption in production like Key Management Service, performance tradeoffs, encryption algorithm choice, etc. In the second half, we will cover the new Data Masking features in Parquet. There will be discussion about motivation behind data masking, the security implications of masking and implementation. Finally we will look at the tradeoffs between the different types of masks and the limitations of each type in terms of compression ratio, table joins, usability and administration overhead.

Pavi Subenderan:
Pavi is a Software Engineer on Uber's Data Infra team. His focus is on data security, privacy and open source big data technologies. He has been working on Parquet column encryption for 1.5 years and more recently on data masking.
Xinli Shang:
Xinli Shang is a tech lead on the Uber Data Infra team, Apache Parquet Committer. He is passionate about big data file format for efficiency, performance and security, tuning large scale services for performance, throughput, and reliability. He is an active contributor to Apache Parquet. He also has many years of experience developing large scale distributed systems like S3 Index, and operating system Windows.
Рекомендации по теме
Комментарии
Автор

Follow up: We are nearly a year after the post. Has the data masking been implemented yet?

exnuke
Автор

Amazing session guys, I would like to try this, do you have a demo or a tutorial?

leonardogomez