Skip to content

BigQuery

BigQuery is a fully managed serverless data warehouse.

BigQuery product overview page

Users write ansi SQL to extract data from BigQuery.

BigQuery object hierarchy:

  • Project
  • Dataset
  • Table

Within BigQuery storage is separated from compute.

BigQuery costs are a combination of the size of the data processed, and the amount of compute needed to execute the query.

It is possible to do batch uploads and streaming inserts into BigQuery.

Common sources for data loaded into BigQuery include csv, json and avro files stored on Cloud Storage. Cloud dataflow is also a tool commonly used to load processed data into BigQuery.

Data in BigQuery is stored in a columnar format.

BigQuery best practice patterns:

  • Avoid self joins - use window functions instead
  • If your query processes keys that are heavily skewed to a few values, filter your data as early as possible. Each slot (compute) is responsible for a partition or group of partitions, if there is a particular partition that is particularly large it will still be handled by a single slot.
  • Avoid joins that generate more outputs than inputs. when a cross join is required , pre-aggregate your data.
  • Avoid updating or inserting 1 row at a time. Batch updates and inserts where possible.

An authorised view in BigQuery lets you share query results with particular users and groups without giving them access to the underlying tables.

A common access pattern is to grant users and groups access to a dataset. Users who access a dataset have access to all the tables, views and models held within.

It is also possible to grant users access to a specific table.

It is possible to use BigQuery to analyse data stored in Google Cloud Storage. This can be done by either creating permanent or temporary tables based on the data in Google Cloud storage. Link

A temporary table is best used in one-off instances.