Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Показать описание

ABOUT THE TALK:

Structured Streaming is the next generation of distributed, streaming processing in Apache Spark. Developers can write a query written in their language of choice (Scala/Java/Python/R) using powerful high-level APIs (DataFrames / Datasets / SQL) and apply that same query to both static datasets and streaming data. In case of streaming, Spark will automatically create an incremental execution plan that automatically handles late, out-of-order data and ensures end-to-end exactly-once fault-tolerance guarantees.

In this practical session, I will walk through a concrete streaming ETL example where – in less than 10 lines – you can read raw, unstructured data from Kafka data, transform it and write it out as a structured table ready for batch and ad-hoc queries on up-to-the-last-minute data. I will give a quick glimpse of advanced features like event-time based aggregations, stream-stream joins and arbitrary stateful operations.

ABOUT THE SPEAKER:

Tathagata is a committer and PMC to the Apache Spark project and a Software Engineer at Databricks. He is the lead developer of Spark Streaming, and now focuses primarily on Structured Streaming. Previously, he was a member of the AMPLab, UC Berkeley as a graduate student researcher where he conducted research on data-center frameworks and networks with Scott Shenker and Ion Stoica.

ABOUT DATA COUNCIL:

FOLLOW DATA COUNCIL:

Рекомендации по теме

Комментарии

28:45 to 29:40 is the best!!! :D just don't miss that. sets the context

tejusization

Very simplified approach of explaining streaming.

mohitmehta

Good explanation about streaming..Thanks

HridyanshiB.

What would be an open source equivalent of DB delta?

danielmackie

Appreciated. Thanks you for a great knowledge share.

venkat.k

Good presentation. Would like to understand more how it could integrate and scale with Apache Kafka.

yourstrulyDA

When the data has entered the dataframe, if the data has been updated or deleted, how can I update or delete it in the dataframe?

zhengfang

How about integrating this with Tensorflow Serving for end to end Analytics paradigm

thesleepyhead

23:30 A single rogue timestamp which is one hour ahead of the second max timestamp would drop all earlier buckets except one bucket corresponding to this single anomalous value. This is fragile.

JanekBogucki

Want to know about the best Practices for Real-Time Analytics Architecture on Big Data?

GrayMatterSoftware

This is a big disappointment. You cannot stream pipelines built with dataframes. Unified processing framework?? come on!

You have to build new versions of all your algorithms so now they can work with a DStream? What a waste of time.

albertoandreotti

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Data Pipelines Explained

Data Pipelines: Introduction to Streaming Data Pipelines

What is Data Pipeline | How to design Data Pipeline ? - ETL vs Data pipeline (2024)

What is Stream Processing? | Batch vs Stream Processing | Data Pipelines | Real-Time Data Processing

Exploring Real-Time Data Pipelines

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

What are Data Pipelines?

codeLive: Get Started with Zero Copy with Snowflake

O'Reilly Media Webcast: Building Real-Time Data Pipelines

Demo: Real-time Data Pipelines for Databricks Lakehouse with Qlik

Learn Apache Airflow in 10 Minutes | High-Paying Skills for Data Engineers

Automating Real Time Data Pipelines into Databricks Delta - Dan Potter Attunity

How to Build a Real-Time Streaming Data Pipeline with Kinesis | Data Engineering Project

Imperva: Building Real-Time Streaming Data Pipelines Using Amazon MSK

Unleash the power of real-time data across your data pipelines

Create and Curate a Simple Data Pipeline

Building Real-Time Data Pipelines with Kafka , Faust & Snowflake

Building Real-Time Data Pipelines | Estuary's Johnny Graettinger

Data pipelines should be simple!

Realtime Data Pipeline Architecture - Part 1

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Streaming Pipelines With Snowflake Explained In 2 Minutes

Real Time Data Pipeline In A New Age Fintech Firm | Cypher 2022