Google Dataproc

Cloud Dataproc is Google's fully managed service for running Apache Spark and Apache Hadoop clusters.

Apache Spark

Apache Hadoop

Parquet

Parquet is a filetype that is commonly used in the Hadoop ecosystem.

Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files.

Parquet uses the record shredding and assembly algorithm which is superior to simple flattening of nested namespaces. Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

Migration

Migrating from local hadoop to cloud notes

Qwiklab link

Storage

In using Dataproc it is possible to store data externally to the cluster.

For example it is possible to store HDFS taype data in Cloud Storage, and HBASE type data within Cloud Bigtable.

Misc

Dataproc allows a cluster to be spun up and down within a minute or 2. This means specific clusters can be finely tuned for each task. (Rather than having a general purpose cluster to service many needs on prem).

For complicated analysis that require transforms, it is possible to export from bigQuery and process the data in spark (dataproc).

Use initialization actions to install custom software and cluster properties to configure Hadoop.