Creating a Stream Data Pipeline on Google Cloud Platform using Apache Beam

Показать описание

We built a scalable and flexible stream data pipeline for our microservices on Google Cloud Platform (GCP), using Cloud Pub/Sub, Google Cloud Storage, BigQuery, and Cloud Dataflow, using Apache Beam. The stream data pipeline is working on the production system for Mercari, one of the biggest C2C e-commerce services in Japan. The pipeline currently accepts logs from 5+ microservices, and the number will increase soon.

Our microservice architecture is based on the following three concepts:
1. Split the log collection and data processing phases to keep the system simple.
2. Use stream processing in order to achieve low latency.
3. Don’t just accumulate raw data—support structured output that is easier to use.

For each microservice, we will provide a Cloud Pub/Sub “Ramp” to send logs to. Cloud Pub/Sub can have messages that contain an optional byte array in their payloads. The entire message that is ingested into the Ramp uses Cloud Dataflow streaming processing to collect them in the “RawDataHub,” a Cloud Pub/Sub topic for collection. When this happens, the PubsubMessage payload is not changed at all; the metadata necessary for subsequent processing (destination and schema information, data necessary for pipeline metrics, etc.) is provided in the PubsubMessage’s attribute map. This Dataflow job does not do processing for each service or topic—it treats all messages uniformly.The raw data from RawDataHub is then output to two independent Cloud DataFlow streaming processes: “RawDataLake”(the infrastructure is in Google Cloud Storage, or “GCS”) and “StructuredDataHub”, another Cloud Pub/Sub topic. The StructuredDataHub has structured avro records. The structured data from StructuredDataHub is then sent to more two independent Cloud DataFlow streaming processes: “StructuredDataLake” on GCS and “Data WareHouse” (Google BigQuery).
---
The Beam Summit North America 2019 was a 2-day event held in Las Vegas, all focused around Apache Beam.

Apache Beam

Рекомендации по теме

Комментарии

Please, I'd like to see more code details about how to create a stream data pipeline on GCP using Apache Beam. Is it possible?

roruas

Creating a Stream Data Pipeline on Google Cloud Platform using Apache Beam

Data Pipelines: Introduction to Streaming Data Pipelines

How to successfully launch a streaming data pipeline on Google Cloud

Creating a Streaming Data Pipeline With Apache Kafka in 7 minutes

Creating a Streaming Data Pipeline With Apache Kafka || #qwiklabs || #GSP730 | [With Explanation🗣️]...

Create a Streaming Data Pipeline for Product Analytics

Creating a Stream Data Pipeline on Google Cloud Platform using Apache Beam

Building stream processing pipelines with Dataflow

Stream vs Batch processing explained with examples

Using vLLM to get an LLM running fast locally (live stream)

What is Data Pipeline? | Why Is It So Popular?

Imperva: Building Real-Time Streaming Data Pipelines Using Amazon MSK

What is Data Pipeline | How to design Data Pipeline ? - ETL vs Data pipeline (2024)

Realtime Data Streaming | End To End Data Engineering Project

Creating a Streaming Data Pipeline With Apache Kafka || [GSP730] || Solution

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! by Robin Moffatt

Creating a Streaming Data Pipeline With Apache Kafka GSP730

Data Pipelines Explained

Apache Kafka and KSQL in Action : Lets Build a Streaming Data Pipeline! by Viktor Gamov

How to build stream data pipeline with Apache Kafka and Spark Structured Streaming - PyCon SG 2019

Lab : Creating a Streaming Data Pipeline for a Real Time Dashboard with Dataflow

Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline!

Building a Real-Time Data Streaming Pipeline using Apache Kafka, Flink and Postgres

GCP Lab 2 - Creating a Streaming Data Pipeline with DataFlow

Apache Kafka and KSQL in Action : Lets Build a Streaming Data Pipeline! by Viktor Gamov