How Adobe uses Structured Streaming at Scale

Показать описание

Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale specifically with Structured Streaming.

Know thy Lag

While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is.
Reading Data In

Fan Out Pattern using minPartitions to Use Kafka Efficiently
Overload protection using maxOffsetsPerTrigger
More Apache Spark Settings used to optimize Throughput
MicroBatching Best Practices

Map() +ForEach() vs MapPartitons + forEachPartition
Adobe Spark Speculation and its Effects

Calculating Streaming Statistics

Windowing
Importance of the State Store
RocksDB FTW
Broadcast joins
Custom Aggegators
OffHeap Counters using Redis
Pipelining

Connect with us: