Dynamodb input plugin for Embulk

Master CI Status Badge Test CI Status Badge

Overview

  • Plugin type: input
  • Load all or nothing: yes
  • Resume supported: no
  • Cleanup supported: no

Configuration

  • auth_method: name of mechanism to authenticate requests ("basic", "env", "instance", "profile", "properties", "anonymous", "session", "web_identity_token", default: "default")

    • "basic": uses access_key_id and secret_access_key to authenticate.
    • "env": uses AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY) environment variables.
    • "instance": uses EC2 instance profile or attached ECS task role.
    • "profile": uses credentials written in a file. Format of the file is as following, where [...] is a name of profile. ``` [default] aws_access_key_id=YOUR_ACCESS_KEY_ID aws_secret_access_key=YOUR_SECRET_ACCESS_KEY

    [profile2] ...

    - `"properties"`: uses aws.accessKeyId and aws.secretKey Java system properties.
    - `"anonymous"`: uses anonymous access. This auth method can access only public files.
    - `"session"`: uses temporary-generated **access_key_id**, **secret_access_key** and **session_token**.
    - `"assume_role"`: uses temporary-generated credentials by assuming **role_arn** role.
    - `"web_identity_token"`: uses temporary-generated credentials by assuming **role_arn** role with web identity.
    - `"default"`: uses AWS SDK's default strategy to look up available credentials from runtime environment. This method behaves like the combination of the following methods.
    1. `"env"`
    1. `"properties"`
    1. `"web_identity_token"`
    1. `"profile"`
    1. `"instance"`
    
  • profile_file: path to a profile file. this is optionally used when auth_method is "profile". (string, default: given by AWS_CREDENTIAL_PROFILES_FILE environment variable, or ~/.aws/credentials).

  • profile_name: name of a profile. this is optionally used when auth_method is "profile". (string, default: "default")

  • access_key_id: aws access key id. this is required when auth_method is "basic" or "session". (string, optional)

  • secret_access_key: aws secret access key. this is required when auth_method is "basic" or "session". (string, optional)

  • session_token: aws session token. this is required when auth_method is "session". (string, optional)

  • role_arn: arn of the role to assume. this is required for auth_method is "assume_role" or "web_identity_token". (string, optional)

  • role_session_name: an identifier for the assumed role session. this is required when auth_method is "assume_role" or "web_identity_token". (string, optional)

  • role_external_id: a unique identifier that is used by third parties when assuming roles in their customers' accounts. this is optionally used for auth_method: "assume_role". (string, optional)

  • role_session_duration_seconds: duration, in seconds, of the role session. this is optionally used for auth_method: "assume_role". (int, optional)

  • web_identity_token_file: the absolute path to the web identity token file. this is required when auth_method is "web_identity_token". (string, optional)

  • scope_down_policy: an iam policy in json format. this is optionally used for auth_method: "assume_role". (string, optional)

  • endpoint: The AWS Service endpoint (string, optional)

  • region: The AWS region (string, optional)

  • http_proxy: Indicate whether using when accessing AWS via http proxy. (optional)

    • host proxy host (string, required)
    • port proxy port (int, optional)
    • protocol proxy protocol (string, default: "https")
    • user proxy user (string, optional)
    • password proxy password (string, optional)
  • scan: scan operation configuration. This option cannot be used with query option. (See Operation Configuration Details, optional)

  • query: query operation configuration. This option cannot be used with scan option. (See Operation Configuration Details, optional)

  • table: Table Name (string, required)

  • default_timestamp_format: Format of the timestamp if columns.type is "timestamp". (string, optional, default: "%Y-%m-%d %H:%M:%S.%N %z")

  • default_timezone: Time zone of timestamp columns if the value itself doesn’t include time zone description (eg. Asia/Tokyo). (string, optional, default: "UTC")

  • default_date: Set date part if the format doesn’t include date part. (string, optional, default: "1970-01-01")

  • columns: a key-value pairs where key is a column name and value is options for the column. If you do not specify this option, each dynamodb items are processed as a single json. (array of string-to-string map, optional, default: [])

    • name: Name of the column. (string, required)
    • type: Embulk Type of the column that is converted to from dynamodb attribute value as possible. ("boolean", "long","timestamp", "double", "string" or "json", required)
    • attribute_type: Type of the Dynamodb attribute that name matches name of the column. The types except specified one are stored as null when this option is specified. ("S", "N", "B", "SS", "NS", "BS", "M", "L", "NULL" or "BOOL", optional)
    • format: Format of the timestamp if type is "timestamp". (string, optional, default value is specified by default_timestamp_format)
    • timezone: Timezone of the timestamp if the value itself doesn’t include time zone description (eg. Asia/Tokyo). (string, optional, default value is specified by default_timezone)
    • date: Set date part if the format doesn’t include date part. (string, optional, default value is specified by default_date)
  • json_column_name: Name of the column when each dynamodb items are processed as a single json. (string, optional, default: "record")

Operation Configuration Details

Here is the explanation of the configuration for scan option or query option. The configuration has common options and specific options. Sometimes a type called DynamodbAttributeValue appears, see the end of this section first if you are worried about it.

Common Options

  • consistent_read: Require strongly consistent reads or not. ref. (boolean, optional, default: false)
  • exclusive_start_key: When you want to read the middle of the table, specify the attribute as the start key. (string to DynamodbAttributeValue map, optional)
  • expression_attribute_names: An expression attribute name is a placeholder that you use in an Amazon DynamoDB expression as an alternative to an actual attribute name. An expression attribute name must begin with a pound sign (#), and be followed by one or more alphanumeric characters. (string to string map, optional, default: {})
  • expression_attribute_values: If you need to compare an attribute with a value, define an expression attribute value as a placeholder. Expression attribute values in Amazon DynamoDB are substitutes for the actual values that you want to compare—values that you might not know until runtime. An expression attribute value must begin with a colon (:) and be followed by one or more alphanumeric characters. (string to DynamodbAttributeValue map, optional, default: {})
  • filter_expression: A filter expression is applied after the operation finishes, but before the results are returned. Therefore, the operation consumes the same amount of read capacity, regardless of whether a filter expression is present. (string, optional)
  • index_name: Amazon DynamoDB provides fast access to items in a table by specifying primary key values. However, many applications might benefit from having one or more secondary (or alternate) keys available, to allow efficient access to data with attributes other than the primary key. To address this, you can create one or more secondary indexes on a table and issue query or scan operations against these indexes. (string, optional)
  • batch_size: The limit of items by an operation. The final result contains specified number of items or fewer when filter_expression is specified. (int, optional)
  • limit: The limit of total items by operations. (long, optional)
  • projection_expression: To read data from a table, you use operations. Amazon DynamoDB returns all the item attributes by default. To get only some, rather than all of the attributes, use a projection expression. (string, optional)
  • select: The attributes to be returned in the result. You can retrieve all item attributes, specific item attributes, the count of matching items, or in the case of an index, some or all of the attributes projected into the index. ("ALL_ATTRIBUTES", "ALL_PROJECTED_ATTRIBUTES", "SPECIFIC_ATTRIBUTES" or "COUNT", optional)

Options for scan

  • segment: A segment to be scanned by a particular worker. Each worker should use a different value for segment. If segment is not specified and total_segment is specified, this plugin automatically set segment following the number of embulk workers. If segment and total_segment is specified, this plugin loads only the segment, so you loads other segments in other processes. (int, optional)
  • total_segment: The total number of segments for the parallel scan. If segment is not specified and total_segment is specified, this plugin automatically set segment following the number of embulk workers.

Options for query

  • key_condition_expression: To specify the search criteria, you use a key condition expression—a string that determines the items to be read from the table or index. (string, required)
  • scan_index_forward: By default, the sort order is ascending. To reverse the order, set this option is false. (boolean, optional, default: false)

About DynamodbAttributeValue Type

This type of DynamodbAttributeValue is one that can express Dynamodb AttributeValue as Embulk configuration. This configuration has the below options. Only one of these options can be set.

  • S: string value (string, optional)
  • N: number value. (string, optional)
  • B: binary value. (string, optional)
  • SS: array of string value. (array of string, optional)
  • NS: array of number value. (array of number, optional)
  • BS: array of binary value. (array of binary, optional)
  • M: map value. (string to DynamodbAttributeValue map, optional)
  • L: list value. (array of DynamodbAttributeValue, optional)
  • NULL: null or not. (boolean, optional)
  • BOOL: true or false. (boolean, optional)

Deprecated Configuration

You can use the below options yet for the backward compatibility before v0.3.0. However, these are already deprecated, so please use new options instead.

  • access_key: [Deprecated: Use **access_key* instead]* aws access key id. this is required when auth_method is "basic" or "session". (string, optional)
  • secret_key: [Deprecated: Use **secret_access_key* instead]* aws secret access key. this is required when auth_method is "basic" or "session". (string, optional)
  • end_point: [Deprecated: Use **endpoint* instead]* The AWS Service endpoint (string, optional)
  • operation: [Deprecated: Use **scan* or query option instead]* Operation Type ("scan" or "query", required)
  • filters: [Deprecated: Use **query.filter_expression* option or query.filter_expression instead]* Query Filters. (Required if operation is "query", optional if operation is "scan")
    • name: Column name.
    • type: Column type.
    • condition: Comparison Operator.
    • value(s): Attribute Value(s).
  • limit: [Deprecated: Use **query.batch_size* option or query.batch_size instead]* DynamoDB 1-time Scan/Query Operation size limit (int, optional)
  • scan_limit: [Deprecated: Use **query.batch_size* option or query.batch_size instead]* DynamoDB 1-time Scan Query size limit (int, optional)
  • record_limit: [Deprecated: Use **query.limit* option or query.limit instead]* Max Record Search limit (long, optional)
  • columns: [Deprecated: This **columns* option for the deprecated operation. See the above columns option when using a new operation.]* a key-value pairs where key is a column name and value is options for the column (required)
    • name: Column name. (string, required)
    • type: Column values are converted to this embulk type. ("boolean", "long", "double", "string", "json", required)
    • NOTE: Be careful that storing values is skipped when you specify "timestamp".

Example

  • Scan Operation
in:
  type: dynamodb
  auth_method: env
  region: us-east-1
  scan:
    total_segment: 20
  table: embulk-input-dynamodb_example
  columns:
    - {name: ColumnA, type: long}
    - {name: ColumnB, type: double}
    - {name: ColumnC, type: string, attribute_type: S}
    - {name: ColumnD, type: boolean}
    - {name: ColumnE, type: timestamp}
    - {name: ColumnF, type: json}

out:
  type: stdout
  • Query Operation
in:
  type: dynamodb
  auth_method: env
  region: us-east-1
  query:
    key_condition_expression: "#x = :v"
    expression_attribute_names:
      "#x": primary-key
    expression_attribute_values:
      ":v": {S: key-1}
  table: embulk-input-dynamodb_example

out:
  type: stdout

You can see more examples here.

Development

Run examples

$ ./run_dynamodb_local.sh
$ ./example/prepare_dynamodb_table.sh
$ ./gradlew classpath
$ embulk run example/config-query.yml -Ilib

Run tests

## Run dynamodb-local
$ ./run_dynamodb_local.sh
$ AWS_ACCESS_KEY_ID=${YOUR_AWS_ACCESS_KEY_ID} \
    AWS_SECRET_ACCESS_KEY=${YOUR_AWS_SECRET_ACCESS_KEY} \
    EMBULK_DYNAMODB_TEST_ACCESS_KEY=${YOUR_AWS_ACCESS_KEY_ID} \
    EMBULK_DYNAMODB_TEST_SECRET_KEY=${YOUR_AWS_SECRET_ACCESS_KEY} \
    EMBULK_DYNAMODB_TEST_PROFILE_NAME=${YOUR_AWS_PROFILE} \
    EMBULK_DYNAMODB_TEST_ASSUME_ROLE_ROLE_ARN=${YOUR_ROLE_ARN} \
    ./gradlew test

If you do not have any real aws account, you can skip the tests that use the real aws account.

$ ./run_dynamodb_local.sh
$ RUN_AWS_CREDENTIALS_TEST=false ./gradlew test

Run the formatter

## Just check the format violations
$ ./gradlew spotlessCheck

## Fix the all format violations
$ ./gradlew spotlessApply

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Release gem:

Fix build.gradle, then

$ ./gradlew gemPush

ChangeLog

CHANGELOG.md

License

MIT LICENSE