Google Cloud Bigquery extract file input plugin for Embulk

development version.

Overview

Plugin type: file input
Resume supported: no
Cleanup supported: yes

Detail

Read files stored in Google Cloud Storage, that exported from Google Cloud Bigquery's table or query result.

Maybe solution for very big data in bigquery.

If you set table config without query config, then just extract table to Google Cloud Storage.

If you set query config, then query result save to temp table and then extracted that temp table to Google Cloud Storage uri. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract

Usage

Install plugin

embulk gem install embulk-input-bigquery_extract_files

rubygem url : https://rubygems.org/profiles/jo8937

Configuration

project: Google Cloud Platform (gcp) project id (string, required)
json_keyfile: gcp service account's private key with json (string, required)
gcs_uri: bigquery result saved uri. bucket and path names parsed from this uri. (string, required)
temp_local_path: extract files download directory in local machine (string, required)
dataset: target datasource dataset (string, default: null)
table: target datasource table. either query or table are required. (string, default: null)
query: target datasource query. either query or table are required. (string, default: null)
temp_dataset: if you use query param, query result saved here (string, default: null)
temp_table: if you use query param, query result saved here. if not set, plugin generate temp name (string, default: null)
use_legacy_sql: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.useLegacySql (string, default: false)
cache: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.useQueryCache (string, default: true)
create_disposition: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.createDisposition (string, default: CREATE_IF_NEEDED)
write_disposition: if you use query param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.writeDisposition (string, default: WRITE_APPEND)
file_format: Table extract file format. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract.destinationFormat (string, default: CSV)
compression: Table extract file compression setting. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract.compression (string, default: GZIP)
temp_schema_file_path: bigquery result schema file for parser. (Optional) (string, default: null)
temp_schema_file_type: default is embulk's Schema object. (Optional) (string, default: null)
decoders: embulk java-file-input plugin's default attribute. see : http://www.embulk.org/docs/built-in.html#gzip-decoder-plugin
parser: embulk java-file-input plugin's default .attribute see : http://www.embulk.org/docs/built-in.html#csv-parser-plugin

Example

in:
  type: bigquery_extract_files
  project: googlecloudplatformproject
  json_keyfile: gcp-service-account-private-key.json
  dataset: target_dataset
  #table: target_table
  query: 'select a,b,c from target_table'
  gcs_uri: gs://bucket/subdir
  temp_dataset: temp_dataset
  temp_local_path: C:\Temp
  file_format: 'NEWLINE_DELIMITED_JSON'
  compression: 'GZIP'
  decoders:
  - {type: gzip}  
  parser:
    type: json
out: 
  type: stdout

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously