A Deep Dive into Stateful Stream Processing in Structured Streaming 2018 Part 2 (Tathagata Das)

preview_player
Показать описание
Tathagata Das is an Apache Spark committer and a member of the PMC. He's the lead developer behind Spark Streaming and currently develops Structured Streaming.

Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

How can we do deduplication and keep the last record instead of first (based on timestamp field in dataframe)? Current implementation for dropDuplicates keep the first occurrence and ignores all subsequent occurrences for that key, how can we tell Spark to update the state and keep the most recent value based on timestamp field.

AashishOla
visit shbcf.ru