Streaming data from BigQuery to Datastore using Dataflow

Показать описание

🚀 Today, I'm eager to discuss a method for using a Dataflow streaming pipeline to move data from BigQuery to Cloud Datastore.. At first glance, this approach might seem unconventional. Why? Because BigQuery isn't typically associated with streaming capabilities. However, I believe this strategy has immense potential.

🔍 Here's the context: a significant portion of our data now resides in BigQuery, structured through tools like DBT or other data transformation frameworks. This shift means much of the information we need isn't just in event-based message queues anymore. And here's an interesting observation: in most real-world scenarios, near real-time data transfer (ranging from 30 seconds to an hour) is often more than sufficient.

💡 This leads me to propose a versatile, reusable solution for moving data from BigQuery to Cloud Datastore, or perhaps to other target databases. The potential value this could add, especially in terms of real-time data processing and analytics, is substantial.

🤔 I'm eager to hear your thoughts on this. And for the Apache Beam experts out there, I'd particularly value your insights. Are there any blind spots in this approach? Could there be reliability issues under certain conditions? Your expert critique is invaluable!

#Dataflow #BigQuery #CloudDatastore #StreamingData #DataEngineering #InnovationInTech #PracticalGCP #ApacheBeam

00:39 - Solutions we talked about so far
06:16 - How about a batch Dataflow job?
08:08 - How about a streaming Dataflow job?
09:32 - It's a bit complex, but there is a way
10:37 - Key design considerations
14:18 - Detailed Design - The flow
16:32 - Detailed Design - Impulse window & checkpointing
20:32 - Detailed Design - Checkpointing logic
26:18 - Code & Demo
38:47 - Pros & cons plus ideas

Corrections: the BETWEEN filter shown in the video has a bug caused by all inclusive filters, I've since replaced it with greater than / equals to make sure there are no overlaps.

Рекомендации по теме

Комментарии

Nice video! Just got a question for myself: At 23:46 you mentioned that if the job was killed at "row number 2", some data between 0001-01-01 00:00:00 and 2023-11-11 23:00:15 were processed and some are not. When the pipeline is resumed, doesn't it start from the beginning of the work process and inserts the checkpoint table with a new row? (so 2023-11-11 23:00:15 will be the last processed timestamp).

andrewwang

Getting below error while running pipeline with DirectRunner. Any idea?

Transform node AppliedPTransform(Start Impulse FakePii/GenSequence/ProcessKeyedElements/GroupByKey/GroupByKey, _GroupByKeyOnly) was not replaced as expected.

viralsurani

Streaming data from BigQuery to Datastore using Dataflow

Streaming data from BigQuery to Datastore using Dataflow

Real time - Streaming Data from PubSub to BigQuery Using Dataflow in GCP

How I: Use BigQuery to visualize streaming data

Introduction to Datastream for BigQuery

Stream Processing Pipeline - Using Pub/Sub, Dataflow & BigQuery

Enabling streaming GA4 data to Google BigQuery

Streaming data into Google BigQuery with special guest Streak

Streaming data via PUB/SUB | Dataflow | Bigquery | Looker studio | How to data streaming in G-Cloud?

Using MySQL Document Store with Java

Streaming data from Cloud Storage into BigQuery using Cloud Functions

5. Streaming into Google BigQuery works!

Streaming Relational Data into Google BigQuery

Streaming Analytics into BigQuery: Challenge Lab | #qwiklabs | #ARC106

BigQuery: Integrating and streaming from Firebase

01 Cloud Dataflow - Pub/Sub to Big Query Streaming

BigQuery migration, streaming files, & more!

[NEW] Streaming Analytics into BigQuery: Challenge Lab || #qwiklabs || #ARC106 | With Explanation🗣️]...

How to query for data in streaming buffer ONLY in BigQuery?

Seamless Data Integration: ETL from Google Cloud Storage Bucket to BigQuery with Cloud Functions

[TASK 4] Streaming Analytics into BigQuery: Challenge Lab #qwiklabs || #ARC106 | With Explanation🗣️]...

Building stream processing pipelines with Dataflow

Load Data from GCS to BigQuery using Dataflow

Insertando data en Streaming a BigQuery

PDE-3 Quick, GCP Data Engineer - BigQuery, streaming, de-duplication, partition, analytic functions