Google Dataflow
Dataflow is a serverless data processing service that allows you to process streaming and batch data.
Dataflow fundamentals
Dataflow is Google's managed service for Apache Beam.
The fundamental components of Apache Beam are:
- Pipeline :: a user constructed graph that defines the desired data processing operations
- PCollection :: a data set or data stream
- PTransform :: represents a data processing operation
Dataflow windows
Tumbling Windows
Hopping Windows
Session Windows
Dataflow in action
Common use case pushing data to multiple locations link
Qwiklab Dataflow with python
Qwiklab Dataflow templates
Dataflow pipelines can be used on both batch and streaming pipelines.
Dataflow allows code to be written in Java or Python
Dataflow provides the execution framework.
For Dataflow users, use roles to access dataflow resources.
A pipeline is a more maintainable and less error-prone way to organize data processing code.
"GroupByKey" is an example of a dataflow operation that can be computationally expensive.
Understand side inputs!!