Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends

preview_player
Показать описание
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads – from real-time processing and aggregation of user / behavioral data, rule-based / conditional distribution of event and metric streams, to almost any data pipeline / lineage problems. These workloads are typical in most modern data platforms and are critical to all operational analytics systems, data storage systems, ML / DL and beyond. One of the common problems I’ve seen across a lot of companies can be reduced to general data reliability problems. Mainly due to scaling and migrating processing components as a company expands and teams grow. What was a few systems can quickly fan out into a slew of independent components and serving-layers all whom need to be scaled up, down or out with zero-downtime to meet the demands of a world hungry for data. During this technical deep dive, a new mental model will be built up which aims to reinvent how one should build massive, interconnected services using Kafka, Google Protocol Buffers / gRPC, and Parquet/Delta Lake/Spark Structured Streaming. The material presented during the deep dive is based on lessons learned the hard-way while building up a massive real-time insights platform at Twilio where data integrity and stream fault-tolerance is as critical as the services our company provides.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

Thanks for doing this. Excellent overview of all the pieces involved. We are using similar architecture with protobuf/Kafka/pyspark for standardizing our data engg pipelines.

PokeRowlet