Streaming data from BigQuery to Datastore using Dataflow

preview_player
Показать описание
🚀 Today, I'm eager to discuss a method for using a Dataflow streaming pipeline to move data from BigQuery to Cloud Datastore.. At first glance, this approach might seem unconventional. Why? Because BigQuery isn't typically associated with streaming capabilities. However, I believe this strategy has immense potential.

🔍 Here's the context: a significant portion of our data now resides in BigQuery, structured through tools like DBT or other data transformation frameworks. This shift means much of the information we need isn't just in event-based message queues anymore. And here's an interesting observation: in most real-world scenarios, near real-time data transfer (ranging from 30 seconds to an hour) is often more than sufficient.

💡 This leads me to propose a versatile, reusable solution for moving data from BigQuery to Cloud Datastore, or perhaps to other target databases. The potential value this could add, especially in terms of real-time data processing and analytics, is substantial.

🤔 I'm eager to hear your thoughts on this. And for the Apache Beam experts out there, I'd particularly value your insights. Are there any blind spots in this approach? Could there be reliability issues under certain conditions? Your expert critique is invaluable!

#Dataflow #BigQuery #CloudDatastore #StreamingData #DataEngineering #InnovationInTech #PracticalGCP #ApacheBeam

00:39 - Solutions we talked about so far
06:16 - How about a batch Dataflow job?
08:08 - How about a streaming Dataflow job?
09:32 - It's a bit complex, but there is a way
10:37 - Key design considerations
14:18 - Detailed Design - The flow
16:32 - Detailed Design - Impulse window & checkpointing
20:32 - Detailed Design - Checkpointing logic
26:18 - Code & Demo
38:47 - Pros & cons plus ideas

Corrections: the BETWEEN filter shown in the video has a bug caused by all inclusive filters, I've since replaced it with greater than / equals to make sure there are no overlaps.
Рекомендации по теме
Комментарии
Автор

Nice video! Just got a question for myself: At 23:46 you mentioned that if the job was killed at "row number 2", some data between 0001-01-01 00:00:00 and 2023-11-11 23:00:15 were processed and some are not. When the pipeline is resumed, doesn't it start from the beginning of the work process and inserts the checkpoint table with a new row? (so 2023-11-11 23:00:15 will be the last processed timestamp).

andrewwang
Автор

Getting below error while running pipeline with DirectRunner. Any idea?

Transform node AppliedPTransform(Start Impulse FakePii/GenSequence/ProcessKeyedElements/GroupByKey/GroupByKey, _GroupByKeyOnly) was not replaced as expected.

viralsurani