Three Ways to Get Data Out of BigQuery
Python Google Cloud Platform | 2021-03-17 |
BigQuery is just a data store, hosted by Google, part of the Google Cloud set of services. It's a "Big Data" tool, designed for use with large data sets. The cost structure makes it not efficient for everyday data uses, such as storing user accounts or backing a publishing CMS. Instead, it's best for datasets that can grow to millions to billions of rows, like event or statistical data, data that's going to be used with analysis or business intelligence tools.
You can access BigQuery public datasets by using the Cloud Console, by using the bq command-line tool, or by making calls to the BigQuery REST API using client libraries written in Python and other languages.
For learning purposes, use public datasets. You do need to have your own Google Cloud project, but you don't have to add billing information. Google Cloud offers a free tier - you can query up to 1 TB per month, so you have some room to play before the requirement for billing kicks in.
The Console
The browser-based console is the easiest tool for requesting BigQuery data.
Let's start with a really small database - the `stations` table in the `new_york_subway` dataset in the `bigquery-public-data` project:
When you click over to that table, you'll see a few different menus. Look for the three tabs `schema`, `details`, and `preview`.
- Schema
- Details shows metadata about the table, like the number of rows, the last updated date, partition information, etc.
- Preview lets you browse the data - you can scroll from page to page, viewing 10 rows at a time
To open a query editor, click on the QUERY TABLE button. This will open a window with an incomplete query, autopopulated with the current date as a partition filter, and no columns identified:
SELECT FROM `bigquery-public-data.new_york_subway.stations` WHERE DATE(_PARTITIONTIME) = "2021-03-16" LIMIT 1000
You will get an error message if you try to run this, as it's not valid SQL.
Modify the query to make it valid:
SELECT * FROM `bigquery-public-data.new_york_subway.stations` WHERE DATE(_PARTITIONTIME) = "2021-03-16" LIMIT 1000
Press the RUN button. This second attempt will run but will not return results. The `Details` tab showed us that the table is partitioned, but there is no data in the partition for "2021-03-16". The last time the table was updated was in September, 2019. To see what partitions we have available, we can try:
SELECT DISTINCT _PARTITIONTIME as partitiontime FROM `bigquery-public-data.new_york_subway.stations`
This query on a larger table could be costly, but because this table only has 496 rows, we can run this without incurring costs.
Press the RUN button. The result is just one row, so all the data in the table is in the same partition (fwiw this is not a very efficient use of partitions, but then again this is flat data that doesn't have a logical date division - we're just sampling here).
2019-09-18 00:00:00 UTC
Next, run the original query with this partition date value and expect to see all 496 rows in your result:
SELECT * FROM `bigquery-public-data.new_york_subway.stations` WHERE DATE(_PARTITIONTIME) = "2019-09-18"
Other things you can do in the console:
- Click on `MORE > Format query` to make your query more readable in the Editor window
- `MORE > Query settings` allows you to do things like send results to a temp table, or select standard or legacy SQL
What you can do with your results:
- Export to a file (in various formats):
- CSV or JSON with file size limits - on Google Drive up to 1GB, local up to 16k rows
- Google Sheets up to 16k rows
- copy to clipboard, up to 16k rows
- Save to a new table:
- The only export option with no limits is to save to a new BigQuery table
- If you're working with a very large dataset, a common pattern is to query once to generate a separate lookup table, then query that second table to drill down to the detailed information you need
- Export to GCS
- No query, just follow Export > Export to GCS from the table view and select a target format (CSV, JSON, or Avro)
Command Line Tool
The console is great for exploring and even exporting some small sets of data, but sometimes you just want to be able to perform database operations from the command line, you know?
`bq` is the BigQuery command line tool that's packaged with the Google Cloud SDK, so when you install it you'll also get the `gsutil` and `gcloud` tools. To install the whole bundle, run:
curl https://sdk.cloud.google.com | bash
You'll have to answer some questions and go through an auth process with your project before you can query from the public datasets:
gcloud init
gcloud auth login
gcloud config set project your-project-name
Once that's all done, you can run queries just by passing raw SQL in single quotes:
bq query --nouse_legacy_sql 'SELECT COUNT(*) FROM bigquery-public-data.samples.shakespeare'
The `--nouse_legacy_sql` flag is an awkward way to force BigQuery to use standard SQL. In my opinion, standard SQL should be the default, and maybe that's coming, but in the meantime you have to pass that setting. Alternately you can use `--use_legacy_sql=False`. It's weird, I know.
bq query --use_legacy_sql=False 'SELECT COUNT(*) FROM bigquery-public-data.samples.shakespeare'
There are a few other flags you can use with the `bq query` command, but the two you'll probably use most commonly are:
- --format=prettyjson (or --format=csv)
- --destination_table TABLE
For more information about how to use the `bq` command line tool, check out the Google docs. There is also a separate command and flag reference. Finally, you can run `bq --help` to get a list of global flags, `bq help` to get a list of commands, and `bq help
BigQuery Client
The command line can be convenient and easy to use, but sometimes you need to interact with BigQuery data programmatically. For that, you want the google-cloud-bigquery library. Google does have versions of this BigQuery client for Java and a few other languages, but here we're just talking about Python.
This library is dead simple to use. You can install it with pip, then all you have to do is import, instantiate the BigQuery client, and pass a query to the client.
from google.cloud import bigquery
client = bigquery.Client()
# Perform a query.
QUERY = (
'SELECT * FROM `bigquery-public-data.new_york_subway.stations` '
'WHERE DATE(_PARTITIONTIME) = "2019-09-18"'
)
query_job = client.query(QUERY)
rows = query_job.result()
for row in rows:
print(row.name)
One key thing this sample is missing is a reference to credentials. When you're working with your own Google Cloud account, you'll need to generate and download JSON credentials to give yourself access. It's a simple process - there's a guide here. Once you have that file saved locally, set an environment variable `GOOGLE_APPLICATION_CREDENTIALS` where the value is the path to that file:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
You don't need to reference the ENV var in your script - just having it set in your environment is enough.
The above sample code is a really simple example. Here we're just using the client to query, but it's also commonly used to write data into tables, which requires a number of configuration options - you have to define the source format (such as `NEWLINE_DELIMITED_JSON` or `CSV`), partition settings, etc. Here's a slightly redacted example from some of my own code, of a load job that replaces the entire content of a table by loading a CSV file from it's location in Google Cloud Storage into an existing target table:
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset('my-dataset')
table_name = 'my-table'
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = 'WRITE_TRUNCATE'
job_config.schema_update_options = [
# If the CSV file has columns that are not in the existing table,
# allow the new columns to be added during the load job
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION
]
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.allow_quoted_newlines = True
# Optional partitioning configuration
job_config.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY)
load_job = client.load_table_from_uri(
'gs://%s/%s' % (`gcs-bucket-path`, `gcs-file-name`),
dataset_ref.table(table_name),
job_config=job_config) # API request
try:
print(f'Starting job {load_job.job_id}')
load_job.result()
print('Job finished.')
destination_table = client.get_table(dataset_ref.table(table_name))
print(f'Loaded {destination_table.num_rows} rows to {dataset_ref.table(table_name).table_id)}'
except Exception as e:
print(f'Job errors: {load_job.errors}')
print(f'Error: {e}')
Reference the general BigQuery client documentation and the Python client library documentation for more.
So there you go. Three easy ways to pull data out of BigQuery tables. What you do with it is up to you.