Loading Parquet Data from Google Cloud Storage

This page provides an overview of loading Parquet data from Cloud Storage into BigQuery.

Parquet is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.

When you load Parquet data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

For information about loading Parquet data from a local file, see Loading data into BigQuery from a local data source.

Parquet schemas

When you load Parquet files into BigQuery, the table schema is automatially retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following Parquet files in Cloud Storage:

gs://mybucket/00/ a.parquet z.parquet gs://mybucket/01/ b.parquet

This command loads all of the files in a single CLI command (as a comma-separated list), and the schema is derived from mybucket/01/b.parquet:

bq load --source_format=PARQUET [DATASET].[TABLE] "gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"

When you load multiple Parquet files that have different schemas, identical columns specified in multiple schemas must have the same mode in each schema definition.

When BigQuery detects the schema, some Parquet data types are converted to BigQuery data types to make them compatible with BigQuery SQL syntax. For more information, see Parquet conversions.

Parquet compression

Compressed Parquet files are not supported, but compressed data blocks are. BigQuery supports Snappy and GZip codecs.

Loading Parquet data into a new table

To load Parquet data from Google Cloud Storage into a new BigQuery table:

Web UI

Command-line

API

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported.
    • For File format, select Parquet.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. The schema is self-described in Parquet files.
  6. Click Create Table.
Use the bq load command, specify PARQUET as the source_format, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard.

bq load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [FORMAT] is PARQUET.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.parquet into a table named mytable in mydataset:

    bq load --source_format=PARQUET mydataset.mytable gs://mybucket/mydata.parquet
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard:

    bq load --source_format=PARQUET mydataset.mytable gs://mybucket/mydata*.parquet
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards:

    bq load --autodetect --source_format=PARQUET mydataset.mytable "gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"

Set the following properties to load Parquet data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character.

  3. Specify the Parquet data format by setting the configuration.load.sourceFormat property to PARQUET.

  4. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

Appending to or overwriting a table with Parquet data

You can load additional data into a table either from source files or by appending query results.

In the BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result. The CLI and API include the following options:

Web UI option CLI flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

To load Parquet data from Google Cloud Storage and to append to or overwrite a BigQuery table:

Web UI

Command-line

API

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported.
    • For File format, select Parquet.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. Schema information is self-described in Parquet files.
  6. In the Options section, for Write preference, choose Write if empty, Append to table, or Overwrite table.

    Add schema using add fields

  7. Click Create Table.

Enter the bq load command with the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

bq load --[no]replace [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.parquet and overwrites a table named mytable in mydataset:

    bq load --replace --source_format=PARQUET mydataset.mytable gs://mybucket/mydata.parquet
  • The following command loads data from gs://mybucket/mydata.parquet and appends data to a table named mytable in mydataset:

    bq load --noreplace --source_format=PARQUET mydataset.mytable gs://mybucket/mydata.parquet

Set the following properties to load CSV data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported when loading CSV data from Google Cloud Storage.

  3. Specify the data format by setting the configuration.load.sourceFormat property to PARQUET.

  4. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE, WRITE_APPEND, or WRITE_EMPTY.

Parquet conversions

BigQuery converts Parquet data types to the following BigQuery data types:

Type conversions

Parquet type Parquet converted type(s) BigQuery data type
BOOLEAN NONE Boolean
INT32 NONE, UINT_8, UINT_16, UINT_32, INT_8, INT_16, INT_32 Integer
INT32 DATE Date
INT64 NONE, UINT_64, INT_64 Integer
INT64 TIMESTAMP_MILLIS Timestamp
INT64 TIMESTAMP_MICROS Timestamp
INT96 NONE Timestamp
FLOAT NONE Floating point
DOUBLE NONE Floating point
BYTE_ARRAY NONE Bytes
BYTE_ARRAY UTF8 String
FIXED_LEN_BYTE_ARRAY NONE Bytes

Other combinations of Parquet types and converted types are not supported.

Column name conversions

A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 128 characters. A column name cannot use any of the following prefixes:

  • _TABLE_
  • _FILE_
  • _PARTITION

Duplicate column names are not allowed even if the case differs. For example, a column named Column1 is considered identical to a column named column1.

Currently, you cannot load Parquet files that contain a period (.) in the name.

If a Parquet column name contains other characters (aside from a period), the characters are replaced with underscores. Trailing underscores may be added to column names to avoid collisions. For example, if a Parquet file contains 2 columns Column1 and column1, the columns are loaded as Column1 and column1_ respectively.

Send feedback about...