Google BigQuery plugin for tumugi

tumugi-plugin-bigquery is a plugin for integrate Google BigQuery and tumugi.

Installation

Add this line to your application's Gemfile:

gem 'tumugi-plugin-bigquery'

And then execute bundle install.

Target

Tumugi::Plugin::BigqueryDatasetTarget

Tumugi::Plugin::BigqueryDatasetTarget is target for BigQuery dataset.

Parameters

Name	type	required?	default	description
dataset_id	string	required		Dataset ID
project_id	string	optional		Project ID

Examples

task :task1 do
  output target(:bigquery_dataset, dataset_id: "your_dataset_id")
end

task :task1 do
  output target(:bigquery_dataset, project_id: "project_id", dataset_id: "dataset_id")
end

Tumugi::Plugin::BigqueryTableTarget

Tumugi::Plugin::BigqueryDatasetTarget is target for BigQuery table.

Parameters

name	type	required?	description
table_id	string	required	Table ID
dataset_id	string	required	Dataset ID
project_id	string	optional	Project ID

Examples

task :task1 do
  output target(:bigquery_table, table_id: "table_id", dataset_id: "your_dataset_id")
end

Task

Tumugi::Plugin::BigqueryDatasetTask

Tumugi::Plugin::BigqueryDatasetTask is task to create a dataset.

Parameters

name	type	required?	default	description
dataset_id	string	required		Dataset ID
project_id	string	optional		Project ID

Examples

task :task1, type: :bigquery_dataset do
  dataset_id 'test'
end

Tumugi::Plugin::BigqueryQueryTask

Tumugi::Plugin::BigqueryQueryTask is task to run query and save the result into the table which specified by parameter.

Parameters

name	type	required?	default	description
query	string	required		query to execute
table_id	string	required		destination table ID
dataset_id	string	required		destination dataset ID
project_id	string	optional		destination project ID
mode	string	optional	"truncate"	specifies the action that occurs if the destination table already exists. see
flatten_results	boolean	optional	true	when you query nested data, BigQuery automatically flattens the table data or not. see
use_legacy_sql	bool	optional	true	use legacy SQL syntanx for BigQuery or not
wait	integer	optional	60	wait time (seconds) for query execution

Examples

truncate mode (default)

task :task1, type: :bigquery_query do
  query      "SELECT COUNT(*) AS cnt FROM [bigquery-public-data:samples.wikipedia]"
  table_id   "dest_table#{Time.now.to_i}"
  dataset_id "test"
end

append mode

If you set mode to append, query result append to existing table.

task :task1, type: :bigquery_query do
  query      "SELECT COUNT(*) AS cnt FROM [bigquery-public-data:samples.wikipedia]"
  table_id   "dest_table#{Time.now.to_i}"
  dataset_id "test"
  mode       "append"
end

Tumugi::Plugin::BigqueryCopyTask

Tumugi::Plugin::BigqueryCopyTask is task to copy table which specified by parameter.

Parameters

name	type	required?	default	description
src_table_id	string	required		source table ID
src_dataset_id	string	required		source dataset ID
src_project_id	string	optional		source project ID
dest_table_id	string	required		destination table ID
dest_dataset_id	string	required		destination dataset ID
dest_project_id	string	optional		destination project ID
force_copy	bool	optional	false	force copy when destination table already exists or not
wait	integer	optional	60	wait time (seconds) for query execution

Examples

Copy test.src_table to test.dest_table.

Normal usecase

task :task1, type: :bigquery_copy do
  src_table_id    "src_table"
  src_dataset_id  "test"
  dest_table_id   "dest_table"
  dest_dataset_id "test"
end

force_copy

If force_copy is true, copy operation always execute even if destination table exists. This means data of destination table data is deleted, so be carefull to enable this parameter.

task :task1, type: :bigquery_copy do
  src_table_id    "src_table"
  src_dataset_id  "test"
  dest_table_id   "dest_table"
  dest_dataset_id "test"
  force_copy      true
end

Tumugi::Plugin::BigqueryLoadTask

Tumugi::Plugin::BigqueryLoadTask is task to load structured data from GCS into BigQuery.

Parameters

name	type	required?	default	description
bucket	string	required		source GCS bucket name
key	string	required		source path of file like "/path/to/file.csv"
table_id	string	required		destination table ID
dataset_id	string	required		destination dataset ID
project_id	string	optional		destination project ID
schema	array of object	required when mode is not "append"		see schema format
mode	string	optional	"append"	specifies the action that occurs if the destination table already exists. see
source_format	string	optional	"CSV"	source file format. see
ignore_unknown_values	bool	optional	false	indicates if BigQuery should allow extra values that are not represented in the table schema
max_bad_records	integer	optional	0	maximum number of bad records that BigQuery can ignore when running the job
field_delimiter	string	optional	","	separator for fields in a CSV file. used only when source_format is "CSV"
allow_jagged_rows	bool	optional	false	accept rows that are missing trailing optional columns. The missing values are treated as null. used only when source_format is "CSV"
allow_quoted_newlines	bool	optional	false	indicates if BigQuery should allow quoted data sections that contain newline characters in a CSV file. used only when source_format is "CSV"
quote	string	optional	"\"" (double-quote)	value that is used to quote data sections in a CSV file. used only when source_format is "CSV"
skip_leading_rows	integer	optional	0	.number of rows at the top of a CSV file that BigQuery will skip when loading the data. used only when source_format is "CSV"
wait	integer	optional	60	wait time (seconds) for query execution

Example

Load gs://test_bucket/load_data.csv into dest_project:dest_dataset.dest_table

task :task1, type: :bigquery_load do
  bucket     "test_bucket"
  key        "load_data.csv"
  table_id   "dest_table"
  datset_id  "dest_dataset"
  project_id "dest_project"
end

Tumugi::Plugin::BigqueryExportTask

Tumugi::Plugin::BigqueryExportTask is task to export BigQuery table.

Parameters

name	type	required?	default	description
project_id	string	optional		source project ID
job_project_id	string	optional	same as project_id	job running project ID
dataset_id	string	required	true	source dataset ID
table_id	string	required	true	source table ID
compression	string	optional	"NONE"	[destination file compression]. "NONE": no compression, "GZIP": compression by gzip
destination_format	string	optional	"CSV"	destination file format
field_delimiter	string	optional	","	separator for fields in a CSV file. used only when destination_format is "CSV"
print_header	bool	optional	true	print header row in a CSV file. used only when destination_format is "CSV"
page_size	integer	optional	10000	Fetch number of rows in one request
wait	integer	optional	60	wait time (seconds) for query execution

Examples

Export `src_dataset.src_table` to local file `data.csv`

task :task1, type: :bigquery_export do
  table_id   "src_table"
  datset_id  "src_dataset"

  output target(:local_file, "data.csv")
end

Export `src_dataset.src_table` to Google Cloud Storage

You need tumugi-plugin-google_cloud_storage

task :task1, type: :bigquery_export do
  table_id   "src_table"
  datset_id  "src_dataset"

  output target(:google_cloud_storage_file, bucket: "bucket", key: "data.csv")
end

Export `src_dataset.src_table` to Google Drive

You need tumugi-plugin-google_drive

task :task1, type: :bigquery_export do
  table_id   "src_table"
  datset_id  "src_dataset"

  output target(:google_drive_file, name: "data.csv")
end

Common parameter value

mode

value	description
truncate	If the table already exists, BigQuery overwrites the table data.
append	If the table already exists, BigQuery appends the data to the table.
empty	If the table already exists and contains data, a 'duplicate' error is returned in the job result.

format

value	description
CSV	CSV
NEWLINE_DELIMITED_JSON	Each line is JSON + new line
AVRO	see

schema

Format of schema parameter is array of nested object like below:

[
  {
    "name": "column1",
    "type": "string"
  },
  {
    "name": "column2",
    "type": "integer",
    "mode": "repeated"
  },
  {
    "name": "record1",
    "type": "record",
    "fields": [
      {
        "name": "key1",
        "type": "integer",
      },
      {
        "name": "key2",
        "type": "integer"
      }
    ]
  }
]

Config Section

tumugi-plugin-bigquery provide config section named "bigquery" which can specified BigQuery autenticaion info.

Authenticate by client_email and private_key

Tumugi.configure do |config|
  config.section("bigquery") do |section|
    section.project_id = "xxx"
    section.client_email = "[email protected]"
    section.private_key = "zzz"
  end
end

Authenticate by JSON key file

Tumugi.configure do |config|
  config.section("bigquery") do |section|
    section.private_key_file = "/path/to/key.json"
  end
end

Development

After checking out the repo, run bin/setup to install dependencies. Then, export Google Cloud Platform Service Accounts key as following,

export PROJECT_ID="xxx"
export CLIENT_EMAIL="[email protected]"
export PRIVATE_KEY="zzz"

Then run bundle exec rake test to run the tests.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/tumugi/tumugi-plugin-bigquery.

License

The gem is available as open source under the terms of the Apache License Version 2.0.

Google BigQuery plugin for tumugi

Installation

Target

Tumugi::Plugin::BigqueryDatasetTarget

Parameters

Examples

Tumugi::Plugin::BigqueryTableTarget

Parameters

Examples

Task

Tumugi::Plugin::BigqueryDatasetTask

Parameters

Examples

Tumugi::Plugin::BigqueryQueryTask

Parameters

Examples

truncate mode (default)

append mode

Tumugi::Plugin::BigqueryCopyTask

Parameters

Examples

Normal usecase

force_copy

Tumugi::Plugin::BigqueryLoadTask

Parameters

Example

Tumugi::Plugin::BigqueryExportTask

Parameters

Examples

Export src_dataset.src_table to local file data.csv

Export src_dataset.src_table to Google Cloud Storage

Export src_dataset.src_table to Google Drive

Common parameter value

mode

format

schema

Config Section

Authenticate by client_email and private_key

Authenticate by JSON key file

Development

Contributing

License

Export `src_dataset.src_table` to local file `data.csv`

Export `src_dataset.src_table` to Google Cloud Storage

Export `src_dataset.src_table` to Google Drive