Bigquery
Setting up the BigQuery destination connector involves setting up the data loading method and configuring the BigQuery destination connector using the Calabi Connect UI.
This page guides you through setting up the BigQuery destination connector.
Prerequisites
-
For Calabi Connect users using the Postgres source connector, upgrade your Calabi Connect platform to version
v0.40.0-alphaor newer and upgrade your BigQuery connector to version1.1.14or newer -
A BigQuery dataset to sync data to.
Note: Queries written in BigQuery can only reference datasets in the same physical location. If you plan on combining the data that Calabi Connect syncs with data from other datasets in your queries, create the datasets in the same location on Google Cloud. For more information, read Introduction to Datasets
-
(Required for Calabi Connect; Optional for Calabi Connect) A Google Cloud Service Account with the
BigQuery UserandBigQuery Data Editorroles and the Service Account Key in JSON format.
Setup guide
Step 1: Set up a data loading method
Using Batched Standard Inserts
You can use the BigQuery driver's built-in conversion to take INSERT statements and convert that to file uploads which are then loaded into BigQuery in batches. This is the simplest way to load data into BigQuery in a performant way. These staging files are managed by BigQuery and deleted automatically after the load is complete.
Using a Google Cloud Storage bucket
If you want more control of how and where your staging files are stored, you can opt to use a GCS bucket.
To use a Google Cloud Storage bucket:
- Create a Cloud Storage bucket with the
Protection Tools set to
noneorObject versioning. Make sure the bucket does not have a retention policy. - Create an HMAC key and access ID.
- Grant the
Storage Object Adminrole to the Google Cloud Service Account. This must be the same service account as the one you configure for BigQuery access in the BigQuery connector setup step. - Make sure your Cloud Storage bucket is accessible from the machine running Calabi Connect. The easiest way to verify if Calabi Connect is able to connect to your bucket is via the check connection tool in the UI.
Your bucket must be encrypted using a Google-managed encryption key (this is the default setting
when creating a new bucket). We currently do not support buckets using customer-managed encryption
keys (CMEK). You can view this setting under the "Configuration" tab of your GCS bucket, in the
Encryption type row.
Step 2: Set up the BigQuery connector
- Log into your Calabi Connect or Calabi Connect account.
- Click Destinations and then click + New destination.
- On the Set up the destination page, select BigQuery from the Destination type dropdown.
- Enter the name for the BigQuery connector.
- For Project ID, enter your Google Cloud project ID.
- For Dataset Location, select the location of your BigQuery dataset.
You cannot change the location later.
- For Default Dataset ID, enter the BigQuery Dataset ID.
- For Loading Method, select Batched Standard Inserts or GCS Staging.
- For Service Account Key JSON (Required for cloud, optional for open-source), enter the Google Cloud Service Account Key in JSON format.
Be sure to copy all contents in the Account Key JSON file including the brackets.
- For Transformation Query Run Type (Optional), select interactive to have BigQuery run interactive query jobs or batch to have BigQuery run batch queries.
Interactive queries are executed as soon as possible and count towards daily concurrent quotas and limits, while batch queries are executed as soon as idle resources are available in the BigQuery shared resource pool. If BigQuery hasn't started the query within 24 hours, BigQuery changes the job priority to interactive. Batch queries don't count towards your concurrent rate limit, making it easier to start many queries at once.
- For Google BigQuery Client Chunk Size (Optional), use the default value of 15 MiB. Later, if you see networking or memory management problems with the sync (specifically on the destination), try decreasing the chunk size. In that case, the sync will be slower but more likely to succeed.
Supported sync modes
The BigQuery destination connector supports the following sync modes:
| Sync mode | Supported? |
|---|---|
| Full Refresh - Overwrite | Yes |
| Full Refresh - Append | Yes |
| Full Refresh - Overwrite + Deduped | Yes |
| Incremental Sync - Append | Yes |
| Incremental Sync - Append + Deduped | Yes |
Output schema
The final table contains these fields, in addition to the columns declared in your stream schema:
airbyte_raw_id_airbyte_generation_idairbyte_extracted_at_airbyte_meta
Again, see [here] for more information about these fields.
The output tables in BigQuery are partitioned by the Time-unit column airbyte_extracted_at at a
daily granularity and clustered by airbyte_extracted_at and the table Primary Keys. Partitions
boundaries are based on UTC time. This is useful to limit the number of partitions scanned when
querying these partitioned tables, by using a predicate filter (a WHERE clause). Filters on the
partitioning column are used to prune the partitions and reduce the query cost. (The parameter
Require partition filter is not enabled by Calabi Connect, but you may toggle it by updating the
produced tables.)
Legacy Raw Tables schema
If you enable the Legacy raw tables option, the connector will write tables in this format.
Calabi Connect outputs each stream into its own raw table in airbyte_internal dataset by default (you can
override this via the Airbyte Internal Table Dataset Name option). Contents in the raw table are
NOT deduplicated.
The raw table contains these fields:
_airbyte_raw_id_airbyte_generation_id_airbyte_extracted_at_airbyte_loaded_at_airbyte_meta_airbyte_data
_airbyte_data is a JSON blob with the record's data. See [here]
for more information about the other fields.
BigQuery Naming Conventions
Follow BigQuery Datasets Naming conventions.
Calabi Connect converts any invalid characters into _ characters when writing data. However, since
datasets that begin with _ are hidden on the BigQuery Explorer panel, Calabi Connect prepends the
namespace with n for converted namespaces.
Data type map
| Calabi Connect type | BigQuery type |
|---|---|
| STRING | STRING |
| STRING (BASE64) | STRING |
| STRING (BIG_NUMBER) | STRING |
| STRING (BIG_INTEGER) | STRING |
| NUMBER | NUMERIC |
| INTEGER | INT64 |
| BOOLEAN | BOOL |
| STRING (TIMESTAMP_WITH_TIMEZONE) | TIMESTAMP |
| STRING (TIMESTAMP_WITHOUT_TIMEZONE) | DATETIME |
| STRING (TIME_WITH_TIMEZONE) | STRING |
| STRING (TIME_WITHOUT_TIMEZONE) | TIME |
| DATE | DATE |
| OBJECT | JSON |
| ARRAY | JSON |
Troubleshooting
Permission errors
The service account does not have the proper permissions.
- Make sure the BigQuery service account has
BigQuery UserandBigQuery Data Editorroles or equivalent permissions as those two roles. - If the GCS staging mode is selected, ensure the BigQuery service account has the right permissions
to the GCS bucket and path or the
Cloud Storage Adminrole, which includes a superset of the required permissions.
The HMAC key is wrong.
- Make sure the HMAC key is created for the BigQuery service account, and the service account has permission to access the GCS bucket and path.
HTTP 400 "Request had invalid euc header" during upload
If your sync fails with BigQueryException: 400 Bad Request and the message
Request had invalid euc header:
- This error originates from the Google BigQuery resumable-upload API. It is usually transient.
- Retry the sync. In most cases the error resolves on the next attempt.
- If the error recurs on every sync, the underlying causes can be complex. Google does not
publicly document this specific error. The following steps may help narrow the issue:
- If you self-manage Calabi Connect, check whether a proxy, VPN, or firewall is modifying HTTP
headers on requests to
bigquery.googleapis.com. - Verify the service account key has not been rotated or revoked since the connection was configured.
- Try reducing the Google BigQuery Client Chunk Size from the default 15 MiB to a smaller value (for example, 5 MiB).
- Try reducing concurrent syncs to your BigQuery instance or table. Contention is a possible contributing factor.
- If you self-manage Calabi Connect, check whether a proxy, VPN, or firewall is modifying HTTP
headers on requests to
Load job timeouts
If your sync fails with Fail to complete a load job in big query:
- BigQuery load jobs have a 30-minute wait timeout. Very large batches or high BigQuery queue contention can exceed this limit.
- Running concurrent syncs that load into the same BigQuery table is not supported and can also trigger this timeout. See Stream uniqueness for details.
- Try reducing the volume per sync by using incremental sync mode or reducing the number of streams per connection.
Tutorials
Now that you have set up the BigQuery destination connector, check out the following BigQuery tutorials:
- Export Google Analytics data to BigQuery
- Load data from Facebook Ads to BigQuery
- Replicate Salesforce data to BigQuery
- Partition and cluster BigQuery tables with Calabi Connect and dbt
Namespace support
This destination supports namespaces. The namespace maps to a BigQuery dataset.