Scale By The Bay 2019: David Winters, GDPR Data Cleaner: Mutating Immutable Data

preview_player
Показать описание
Remember when data engineers and data scientists used to say things like: * “Log everything” * “Never throwaway data” * “All data is important” * “What is useless data today is tomorrow’s data of gold” And then that four letter acronym came into our vernacular…. *G-D-P-R* Now, you hear statements like this… * “Do we really need this data?” * “Is this data used at all?” * “What does the GDPR say about this type of data?” Another change that came with the GDPR is the right for a user to request the deletion of their personal data. This is a tricky proposition for those dealing with big data, since all big data technologies were based on the concept of immutable data. Big data systems, such as Hadoop and Spark, scaled so well because there were no updates of data, instead only appends, and the data was written out in large blocks, not conducive to small updates/deletes. In this talk, we discuss how personal data can be cleansed from existing big data storage systems, such as columnar-oriented Hive tables and key-value stores, and we will introduce a new open source project that implements these ideas.

David Winters
GoPro
Big Data Architect
San Francisco Bay Area
TwitterTweet
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka streaming data ingestion pipeline. He has been developing scalable data processing pipelines and eCommerce systems for over 20 years in Silicon Valley. David's current big data interests include streaming data as fast as possible from devices to near real-time dashboards and switching his primary programming language to Scala from Java after nearly 20 years. He holds a B.Sc. in Computer Science from The Ohio State University.
Рекомендации по теме