Skip to content

Data Engineering practice Exam notes

Dataproc is Googles managed service that is able to run Spark, and Hadoop jobs.

https://cloud.google.com/dataproc/

TO DO

Create semi detailed notes on the following products

  • Dataproc (Y)
  • Google Dataflow
  • PubSub
  • BigTable (particularly in use cases vs BigQuery)
  • Cloud Spanner
  • AI Platform
  • Firestore
  • Dialogflow
  • Datastore
  • Tensorflow
  • CloudSQL

Understand the difference between CloudSQL and Cloud Spanner, and when to use each.

Managed services still have some IT overhead.

Understand the array of Machine learning oftions

Your data in Datastore is a property, contained in an Entity and is in a Kind category.

Arrays vs structs

ML APIs

https://cloud.google.com/speech-to-text/docs/reference/rest/v1beta1/RecognitionConfig

https://cloud.google.com/speech-to-text/docs/sync-recognize

https://cloud.google.com/speech-to-text/docs/best-practices

What are the 3 modes of the Natural language API?

Become familiar with each pre-trained model

  • Natural language
  • Speech to text
  • Text to speech
  • Cloud Translation
  • Video intelligence
  • Vision

Data Studio: https://cloud.google.com/ai-platform/docs/technical-overview

Questions and logic answers

Your company built a TensorFlow neural-network model with a large number of neurons and layers. The model fits well for training data. However, when tested against new data, it performs poorly. What method can you enploy to address this?

  1. Threading
  2. Serialization
  3. Dropout Methods
  4. Dimensionality Reduction

Explanation

fitting well on training data and poorly on new data is a sign of overfitting. Dropout method is a specific technique for reducing overfitting in neural networks (link)

Number 3


You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipline to stream new data back to the model as it becomes available. How should you use this data to train the model?

  1. Continuously retrain the model on just the new data
  2. Continuously retrain the model on a combination of existing data and the new data.
  3. Train on the existing data while using the new data as your test set
  4. Train on the new data while using the existing data as your test set

Explanation

Assuming continued accuracy as a goal. For the model to continue to be accurate it should be trained on only new data. This excludes 3 and 2. A test set is always necessary so 4 is the answer


You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What shoud you do?

  1. Store and process the entire dataset in BigQuery
  2. Store and process the entire dataset in Cloud Bigtable.
  3. Store the full dataset in BigQuery and store a compressed copy of the data in a cloud storage bucket.
  4. Store the warm data as files in Cloud storage and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.

Explanation

The data needs to be exposed to tools in other platforms. BigQuery and BigTable do not allow this so 1 and 2 are excluded. Compressed data isn't always available for ingestion by other tools so 3 is excluded.