S3 Parquet output plugin for Embulk
Embulk output plugin to dump records as Apache Parquet files on S3.
Overview
- Plugin type: output
- Load all or nothing: no
- Resume supported: no
- Cleanup supported: yes
Configuration
- bucket: s3 bucket name (string, required)
- path_prefix: prefix of target keys (string, optional)
- sequence_format: format of the sequence number of the output files (string, default:
"%03d.%02d.")- sequence_format formats task index and sequence number in a task.
- file_ext: path suffix of the output files (string, default:
"parquet") - compression_codec: compression codec for parquet file (
"uncompressed","snappy","gzip","lzo","brotli","lz4"or"zstd", default:"uncompressed") - default_timestamp_format: default timestamp format (string, default:
"%Y-%m-%d %H:%M:%S.%6N %z") - default_timezone: default timezone (string, default:
"UTC") - column_options: a map whose keys are name of columns, and values are configuration with following parameters (optional)
- timezone: timezone if type of this column is timestamp. If not set, default_timezone is used. (string, optional)
- format: timestamp format if type of this column is timestamp. If not set, default_timestamp_format: is used. (string, optional)
- canned_acl: grants one of canned ACLs for created objects (string, default:
private) - block_size: The block size is the size of a row group being buffered in memory. This limits the memory usage when writing. Larger values will improve the I/O when reading but consume more memory when writing. (int, default:
134217728(128MB)) - page_size: The page size is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. (int, default:
1048576(1MB)) - max_padding_size: The max size (bytes) to write as padding and the min size of a row group (int, default:
8388608(8MB)) - enable_dictionary_encoding: The boolean value is to enable/disable dictionary encoding. (boolean, default:
true) auth_method: name of mechanism to authenticate requests (
"basic","env","instance","profile","properties","anonymous", or"session", default:"default")"basic": uses access_key_id and secret_access_key to authenticate."env": usesAWS_ACCESS_KEY_ID(orAWS_ACCESS_KEY) andAWS_SECRET_KEY(orAWS_SECRET_ACCESS_KEY) environment variables."instance": uses EC2 instance profile or attached ECS task role."profile": uses credentials written in a file. Format of the file is as following, where[...]is a name of profile. ``` [default] aws_access_key_id=YOUR_ACCESS_KEY_ID aws_secret_access_key=YOUR_SECRET_ACCESS_KEY
[profile2] ...
- `"properties"`: uses aws.accessKeyId and aws.secretKey Java system properties. - `"anonymous"`: uses anonymous access. This auth method can access only public files. - `"session"`: uses temporary-generated **access_key_id**, **secret_access_key** and **session_token**. - `"assume_role"`: uses temporary-generated credentials by assuming **role_arn** role. - `"default"`: uses AWS SDK's default strategy to look up available credentials from runtime environment. This method behaves like the combination of the following methods. 1. `"env"` 1. `"properties"` 1. `"profile"` 1. `"instance"`profile_file: path to a profiles file. this is optionally used when auth_method is
"profile". (string, default: given byAWS_CREDENTIAL_PROFILES_FILEenvironment variable, or ~/.aws/credentials).profile_name: name of a profile. this is optionally used when auth_method is
"profile". (string, default:"default")access_key_id: aws access key id. this is required when auth_method is
"basic"or"session". (string, optional)secret_access_key: aws secret access key. this is required when auth_method is
"basic"or"session". (string, optional)session_token: aws session token. this is required when auth_method is
"session". (string, optional)role_arn: arn of the role to assume. this is required for auth_method is
"assume_role". (string, optional)role_session_name: an identifier for the assumed role session. this is required when auth_method is
"assume_role". (string, optional)role_external_id: a unique identifier that is used by third parties when assuming roles in their customers' accounts. this is optionally used for auth_method:
"assume_role". (string, optional)role_session_duration_seconds: duration, in seconds, of the role session. this is optionally used for auth_method:
"assume_role". (int, optional)scope_down_policy: an iam policy in json format. this is optionally used for auth_method:
"assume_role". (string, optional)catalog: Register a table if this option is specified (optional)
- catalog_id: glue data catalog id if you use a catalog different from account/region default catalog. (string, optional)
- database: The name of the database (string, required)
- table: The name of the table (string, required)
- column_options: a key-value pairs where key is a column name and value is options for the column. (string to options map, default:
{}) - type: type of a column when this plugin creates new tables (e.g.
STRING,BIGINT) (string, default: depends on input column type.BIGINTif input column type islong,BOOLEANif boolean,DOUBLEifdouble,STRINGifstring,STRINGiftimestamp,STRINGifjson) - operation_if_exists: operation if the table already exist. Available operations are
"delete"and"skip"(string, default:"delete")
endpoint: The AWS Service endpoint (string, optional)
region: The AWS region (string, optional)
http_proxy: Indicate whether using when accessing AWS via http proxy. (optional)
- host proxy host (string, required)
- port proxy port (int, optional)
- protocol proxy protocol (string, default:
"https") - user proxy user (string, optional)
- password proxy password (string, optional)
buffer_dir: buffer directory for parquet files to be uploaded on S3 (string, default: Create a Temporary Directory)
Example
out:
type: s3_parquet
bucket: my-bucket
path_prefix: path/to/my-obj.
file_ext: snappy.parquet
compression_codec: snappy
default_timezone: Asia/Tokyo
canned_acl: bucket-owner-full-control
Note
- The current implementation does not support LogicalTypes. I will implement them later as column_options. So, currently timestamp type and json type are stored as UTF-8 String. Please be careful.
Development
Run example:
$ ./gradlew classpath
$ embulk run example/config.yml -Ilib
Run test:
$ ./gradlew test
Build
$ ./gradlew gem # -t to watch change of files and rebuild continuously
Release gem:
Fix build.gradle, then
$ ./gradlew gemPush