embulk-output-bigquery

Embulk output plugin to load/insert data into Google BigQuery via GCS(Google Cloud Storage)

Overview

load data into Google BigQuery as batch jobs via GCS for big amount of data https://developers.google.com/bigquery/loading-data-into-bigquery

Plugin type: output
Resume supported: no
Cleanup supported: no
Dynamic table creating: todo

NOT IMPLEMENTED

insert data over streaming inserts
- for continuous real-time insertions
- Pleast use other product, like fluent-plugin-bigquery
- https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases

Current version of this plugin supports Google API with Service Account Authentication, but does not support OAuth flow for installed applications.

Configuration

service_account_email: your Google service account email (string, required)
p12_keyfile_path: fullpath of private key in P12(PKCS12) format (string, required)
path_prefix: (string, required)
sequence_format: (string, optional, default is %03d.%02d)
file_ext: (string, required)
source_format: file type (NEWLINE_DELIMITED_JSON or CSV) (string, required, default is CSV)
is_file_compressed: upload file is gzip compressed or not. (boolean, optional, default is 1)
bucket: Google Cloud Storage output bucket name (string, required)
remote_path: folder name in GCS bucket (string, optional)
project: project_id (string, required)
dataset: dataset (string, required)
table: table name (string, required)
application_name: application name anything you like (string, optional)
delete_from_local_when_upload_end: (boolean, optional, default is 0)
delete_from_bucket_when_job_end: (boolean, optional, default is 0)
job_status_max_polling_time: max job status polling time. (int, optional, default is 3600 sec)
job_status_polling_interval: job status polling interval. (int, optional, default is 10 sec)
is_skip_job_result_check: (boolean, optional, default is 0)

Support for Google BigQuery Quota policy

embulk-output-bigquery support following Google BigQuery Quota policy.

Supported
- Maximum size per load job: 1TB across all input files
- Maximum number of files per load job: 10,000
- embulk-output-bigquery divides a file into more than one job, like below.
  - job1: file1(1GB) file2(1GB)...file10(1GB)
  - job2: file11(1GB) file12(1GB)
Not Supported
- Daily limit: 1,000 load jobs per table per day (including failures)
- 10,000 load jobs per project per day (including failures)

Example

out:
  type: bigquery
  service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
  p12_keyfile_path: /path/to/p12_keyfile.p12
  path_prefix: /path/to/output
  file_ext: csv.gz
  source_format: CSV
  is_file_compressed: 1
  project: your-project-000
  bucket: output_bucket_name
  remote_path: folder_name
  dataset: your_dataset_name
  table: your_table_name
  formatter:
    type: csv
    header_line: false
  encoders:
  - {type: gzip}

Build

$ ./gradlew gem