Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

preview_player
Показать описание

ABOUT THE TALK:

Structured Streaming is the next generation of distributed, streaming processing in Apache Spark. Developers can write a query written in their language of choice (Scala/Java/Python/R) using powerful high-level APIs (DataFrames / Datasets / SQL) and apply that same query to both static datasets and streaming data. In case of streaming, Spark will automatically create an incremental execution plan that automatically handles late, out-of-order data and ensures end-to-end exactly-once fault-tolerance guarantees.

In this practical session, I will walk through a concrete streaming ETL example where – in less than 10 lines – you can read raw, unstructured data from Kafka data, transform it and write it out as a structured table ready for batch and ad-hoc queries on up-to-the-last-minute data. I will give a quick glimpse of advanced features like event-time based aggregations, stream-stream joins and arbitrary stateful operations.

ABOUT THE SPEAKER:

Tathagata is a committer and PMC to the Apache Spark project and a Software Engineer at Databricks. He is the lead developer of Spark Streaming, and now focuses primarily on Structured Streaming. Previously, he was a member of the AMPLab, UC Berkeley as a graduate student researcher where he conducted research on data-center frameworks and networks with Scott Shenker and Ion Stoica.

ABOUT DATA COUNCIL:

FOLLOW DATA COUNCIL:
Рекомендации по теме
Комментарии
Автор

28:45 to 29:40 is the best!!! :D just don't miss that. sets the context

tejusization
Автор

Very simplified approach of explaining streaming.

mohitmehta
Автор

Good explanation about streaming..Thanks

HridyanshiB.
Автор

What would be an open source equivalent of DB delta?

danielmackie
Автор

Appreciated. Thanks you for a great knowledge share.

venkat.k
Автор

Good presentation. Would like to understand more how it could integrate and scale with Apache Kafka.

yourstrulyDA
Автор

When the data has entered the dataframe, if the data has been updated or deleted, how can I update or delete it in the dataframe?

zhengfang
Автор

How about integrating this with Tensorflow Serving for end to end Analytics paradigm

thesleepyhead
Автор

23:30 A single rogue timestamp which is one hour ahead of the second max timestamp would drop all earlier buckets except one bucket corresponding to this single anomalous value. This is fragile.

JanekBogucki
Автор

Want to know about the best Practices for Real-Time Analytics Architecture on Big Data?

GrayMatterSoftware
Автор

This is a big disappointment. You cannot stream pipelines built with dataframes. Unified processing framework?? come on!

You have to build new versions of all your algorithms so now they can work with a DStream? What a waste of time.

albertoandreotti
visit shbcf.ru