Skip to content

Google Dataflow

Dataflow is a serverless data processing service that allows you to process streaming and batch data.

Dataflow fundamentals

Dataflow is Google's managed service for Apache Beam.

The fundamental components of Apache Beam are:

  • Pipeline :: a user constructed graph that defines the desired data processing operations
  • PCollection :: a data set or data stream
  • PTransform :: represents a data processing operation

Basics of the Beam model

Dataflow windows

Tumbling Windows

Dataflow Tumbling windows

Hopping Windows

Dataflow Hopping windows

Session Windows

Dataflow session windows

Dataflow in action

Common use case pushing data to multiple locations link

Qwiklab Dataflow with python

Qwiklab Dataflow templates

Dataflow pipelines can be used on both batch and streaming pipelines.

Dataflow allows code to be written in Java or Python

Dataflow provides the execution framework.

For Dataflow users, use roles to access dataflow resources.

A pipeline is a more maintainable and less error-prone way to organize data processing code.

"GroupByKey" is an example of a dataflow operation that can be computationally expensive.

Understand side inputs!!