Data Engineering practice Exam notes
Dataproc is Googles managed service that is able to run Spark, and Hadoop jobs.
Create semi detailed notes on the following products
- Dataproc (Y)
- Google Dataflow
- BigTable (particularly in use cases vs BigQuery)
- Cloud Spanner
- AI Platform
Understand the difference between CloudSQL and Cloud Spanner, and when to use each.
Managed services still have some IT overhead.
Understand the array of Machine learning oftions
Your data in Datastore is a property, contained in an Entity and is in a Kind category.
Arrays vs structs
What are the 3 modes of the Natural language API?
Become familiar with each pre-trained model
- Natural language
- Speech to text
- Text to speech
- Cloud Translation
- Video intelligence
Data Studio: https://cloud.google.com/ai-platform/docs/technical-overview
Questions and logic answers
Your company built a TensorFlow neural-network model with a large number of neurons and layers. The model fits well for training data. However, when tested against new data, it performs poorly. What method can you enploy to address this?
- Dropout Methods
- Dimensionality Reduction
fitting well on training data and poorly on new data is a sign of overfitting. Dropout method is a specific technique for reducing overfitting in neural networks (link)
You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipline to stream new data back to the model as it becomes available. How should you use this data to train the model?
- Continuously retrain the model on just the new data
- Continuously retrain the model on a combination of existing data and the new data.
- Train on the existing data while using the new data as your test set
- Train on the new data while using the existing data as your test set
Assuming continued accuracy as a goal. For the model to continue to be accurate it should be trained on only new data. This excludes 3 and 2. A test set is always necessary so 4 is the answer
You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What shoud you do?
- Store and process the entire dataset in BigQuery
- Store and process the entire dataset in Cloud Bigtable.
- Store the full dataset in BigQuery and store a compressed copy of the data in a cloud storage bucket.
- Store the warm data as files in Cloud storage and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.
The data needs to be exposed to tools in other platforms. BigQuery and BigTable do not allow this so 1 and 2 are excluded. Compressed data isn't always available for ingestion by other tools so 3 is excluded.
You are designing storage for 20TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the dat in Cloud Storage with multiple engines. Which storage service and schema design should you use?
- Use Cloud Bigtable for storage. Install HBase shell on a compute Engine instance to query the Cloud Bigtable data.
- Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
- Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
- Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.
"users will query the data in Cloud Storage" implies the data should be stored in cloud storage. This excludes 1 and 2. data will be used multiple times, so this rules out 4.
You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling. Which Google database service should you use.
- Cloud SQL
- Cloud Bigtable
- Cloud datastore
This database should act as an application backend, this excludes 2. Exponential growth rules out 1. Infrastructure scaling on Bigtable is handled by the user by adding nodes, this rules out 3.
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomoly detection method for classifying tissue samples. Which two characteristics support this method? (choose two)
- There are very few occurrances of mutations relative to normal samples.
- There are roughly equal occurences of both normal and mutated samples in the database.
- You expect future mutations to have different features from the mutated samples in the database.
- You expect future mutations to have similar features to the mutated samples in the database.
- You already have labels for which samples are mutated and which are normal in the database.
If future mutations are different to current mutations then it will not be possible to do machine learning, this excludes 3. Unsupervised learning means we do not use labelled samples, this rules out 5. Anomoly detection implies there are relatively few "anomolies" in the data which rules out 2.