Everyday Probabilistic Data Structures for Humans

Показать описание

Processing large amounts of data for analytical or business cases is a daily occurrence for Apache Spark users. Cost, Latency and Accuracy are 3 sides of a triangle a product owner has to trade off. When dealing with TBs of data a day and PBs of data overall, even small efficiencies have a major impact on the bottom line. This talk is going to talk about practical application of the following 4 data-structures that will help design an efficient large scale data pipeline while keeping costs at check.

1. Bloom Filters
2. Hyper Log Log
3. Count-Min Sketches
4. T-digests (Bonus)

We will take the fictional example of an eCommerce company Rainforest Inc and try to answer the business questions with our PDT and Apache Spark and and not do any SQL for this.

1. Has User John seen an Ad for this product yet?
2. How many unique users bought Items A , B and C
3. Who are the top Sellers today?
4. Whats the 90th percentile of the cart Prices? (Bonus)

We will dive into how each of these data structures are calculated for Rainforest Inc and see what operations and libraries will help us achieve our results. The session will simulate a TB of data in a notebook (streaming) and will have code samples showing effective utilizations of the techniques described to answer the business questions listed above. For the implementation part we will implement the functions as Structured Streaming Scala components and Serialize the results to be queried separately to answer our questions. We would also present the cost and latency efficiencies achieved at the Adobe Experience Platform running at PB Scale by utilizing these techniques to showcase that it goes beyond theory.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:

Рекомендации по теме

Everyday Probabilistic Data Structures for Humans

Everyday Probabilistic Data Structures for Humans

Using Probabilistic Data Structures to reduce costs by João Neves & Carlos Rodrigues

Bartosz Adamczewski. Probabilistic Data Structures

Jayapriya Surendran Probabilistic Data Structures in Go

Understanding Probabilistic Data Structures with 112,092 UFO Sightings - Guy Royse - NDC Oslo 2023

Berlin Buzzwords 2016: James Stanier - Acceptably Inaccurate: Probabilistic Data Structures #bbuzz

Probabilistic Data Structures in Adversarial Settings

'Probabilistic Data Structures' By Guy Royse

8 Key Data Structures That Power Modern Databases

Bloom Filters | Algorithms You Should Know #2 | Real-world Examples

No, Maybe and Close Enough! Probabilistic Data Structures with Python and Redis

Understanding Probabilistic Data Structures with 112,092 UFO Sightings By Guy Royse

ReactSphere: Probabilistic data structures for Kafka Streams by Mateusz Owczarek and Miron Ficak

Probabilistic Data Structures with Redis Bloom

This is the strangest but useful data structure that you can use #datastructures

High-Performance Analytics with Probabilistic Data Structures: the Power of HyperLogLog

Using probabilistic data types

Real Time Log Analytics Using Probabilistic Data Structures in Redis

EuroSciPy 2019 Bilbao - Probabilistic Data Structures in Data Intensive Applications - Andrii Gakhov

DataSphere: Probabilistic data structures for Kafka Streams by Mateusz Owczarek and Miron Ficak

Redis HyperLogLog Explained

SE-Radio Episode 358: Probabilistic Data Structure for Big Data Problems

Hyperloglog: Facebook's algorithm to count distinct elements

Bloom Filter Explained: A probabilistic and space efficient data structure for membership testing