Timestamp format filter plugin for Embulk

Build Status

A filter plugin for Embulk to change timestamp format

Configuration

  • columns: columns to retain (array of hash)
    • name: name of column (required)
    • type: type to cast, choose one of string, timestamp, long (unixtimestamp), double (unixtimestamp) (string, default is string)
    • from_format: specify the format of the input string (array of strings, default is default_from_timestamp_format)
    • from_timezone: specify the timezone of the input string (string, default is default_from_timezone)
    • to_format: specify the format of the output string (string, default is default_to_timestamp_format)
    • to_timezone: specify the timezone of the output string (string, default is default_to_timezone)
    • from_unit: specify the time unit of the input unixtimestamp (string, default is default_from_timestamp_unit)
    • to_unit: specify the time unit of the output unixtimestamp (string, default is default_to_timestamp_unit)
  • default_from_timestamp_format: default timestamp format for the input string (array of strings, default is ["%Y-%m-%d %H:%M:%S.%N %z"])
  • default_from_timezone: default timezone for the input string (string, default is UTC)
  • default_to_timestamp_format: default timestamp format for the output string (string, default is %Y-%m-%d %H:%M:%S.%N %z)
  • default_to_timezone: default timezone for the output string (string, default is UTC)
  • default_from_timestamp_unit: default time unit such as second, ms, us, ns for the input unixtimestamp (string, default is second)
  • default_to_timestamp_unit: default time unit such as second, ms, us, ns for the output unixtimestamp (string, default is second)
  • stop_on_invalid_record: stop bulk load transaction if a invalid record is found (boolean, default is false)

Example

Say example.jsonl is as follows (this is a typical format which Exporting BigQuery table outputs):

{"timestamp":"2015-07-12 15:00:00 UTC","nested":{"timestamp":"2015-07-12 15:00:00 UTC"}}
{"timestamp":"2015-07-12 15:00:00.1 UTC","nested":{"timestamp":"2015-07-12 15:00:00.1 UTC"}}
in:
  type: file
  path_prefix: example/example.jsonl
  parser:
    type: jsonl # not json parser
    columns:
    - {name: timestamp, type: string}
    - {name: nested, type: json}
filters:
  - type: timestamp_format
    default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S.%N %z", "%Y-%m-%d %H:%M:%S %z"]
    default_to_timezone: "Asia/Tokyo"
    default_to_timestamp_format: "%Y-%m-%d %H:%M:%S.%N"
    columns:
      - {name: timestamp, type: long, to_unit: ms}
      - {name: $.nested.timestamp}
out:
  type: stdout

Output will be as:

{"timestamp":1436713200000,"nested":{"timestamp":"2015-07-13 00:00:00.0}}
{"timestamp":1436713200100,"nested":{"timestamp":"2015-07-13 00:00:00.1}}

See ./example for more examples.

JRuby Timestamp Parser/Formatter Performance Issue

Embulk's timestamp parser/formatter originally uses jruby implementation, but it is slow. To improve performance, this plugin also supports Java's Joda-Time DateTimeFormat format as:

in:
  type: file
  path_prefix: example/example.jsonl
  parser:
    type: jsonl
    columns:
    - {name: timestamp, type: string}
    - {name: nested, type: json}
filters:
  - type: timestamp_format
    default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss.SSS z", "yyyy-MM-dd HH:mm:ss z", "yyyy-MM-dd HH:mm:ss"]
    default_to_timezone: "Asia/Taipei"
    default_to_timestamp_format: "yyyy-MM-dd HH:mm:ss.SSS Z"
    columns:
      - {name: timestamp, type: long, to_unit: ms}
      - {name: $.nested.timestamp}
out:
  type: stdout

If format strings contain %, jruby parser/formatter is used. Otherwirse, java parser/formatter is used

COMPARISON:

Benchmark test sets are available at ./bench. In my environment (Mac Book Pro), for 1000000 timestamps:

  • java parser / java formatter: 1.3s
  • java parser / jruby formatter: 1.4s
  • jruby parser / java formatter: 64.52s
  • jruby parser / jruby formatter: 65.06s

Nano Resolution

JRuby parser has micro second resolution. Java (Joda-Time) parser has milli second resolution (although Java8's DateTimeFormatter supports nano second resolution)

Nano second resolution is partially supported by this plugin itself. Use parser format nnnnnnnnn for Java parser as

yyyy-MM-dd HH:mm:ss.nnnnnnnnn z

This plugin finds places of nano second from texts with regular expression \.(\d+).

For formatter, you can use nnnnnnnnn for nano and nnnnnn for micro as

yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
yyyy-MM-dd HH:mm:ss.nnnnnn z

ToDo

  • Write test

Development

Run example:

$ ./gradlew classpath
$ embulk preview -I lib example/example.yml

Run test:

$ ./gradlew test

Run checkstyle:

$ ./gradlew check

Release gem:

$ ./gradlew gemPush