Jsonpath parser plugin for Embulk

The JSON with JSONPath parser plugin for the Embulk.

Overview

  • Plugin type: parser
  • Guess supported: yes (A JSON data size supports up to 32KB. #476)

Configuration

  • type: Specify this parser as jsonpath
  • columns: Specify column name and type. See below (array, required)
  • root: Specify data path with JSONPath. It must be Array object (string, default:'$') (detail)
  • stop_on_invalid_record: Stop bulk load transaction if a file includes invalid record (such as invalid timestamp) (boolean, default: false)
  • default_timezone: Default timezone of the timestamp (string, default: UTC)
  • default_timestamp_format: Default timestamp format of the timestamp (string, default: %Y-%m-%d %H:%M:%S.%N %z)
  • default_typecast: Specify whether to cast values automatically to the specified types or not (boolean, default: true)

columns

  • name: Name of the column (string, required)
  • type: Type of the column (string, required)
  • timezone: Timezone of the timestamp if type is timestamp (string, default: default_timestamp)
  • format: Format of the timestamp if type is timestamp (string, default: default_format)
  • typecast: Whether cast values or not (boolean, default: default_typecast)
  • path: JSON ppath for specific column. (string, default: null)

Example

Basic Usage

{
  "count": 100,
  "page": 1,
  "results": [
    {
      "name": "Hugh Rutherford",
      "city": "Mitchellfurt",
      "street_name": "Ondricka Island",
      "zip_code": "75232",
      "registered_at": "2015-09-09 05:28:45",
      "vegetarian": true,
      "age": 44,
      "ratio": 79.092
    },
    {
      "name": "Miss Carmella Bashirian",
      "city": "Madilynchester",
      "street_name": "Rhea Walks",
      "zip_code": "44398",
      "registered_at": "2014-07-01 04:25:27",
      "vegetarian": true,
      "age": 73,
      "ratio": 50.608
    }]
}
in:
  type: any file input plugin type
  parser:
    type: jsonpath
    root: "$.results"
    default_timezone: "Asia/Tokyo"
    columns:
      - { name: "name",          type: string }
      - { name: "city",          type: string }
      - { name: "street_name",   type: string }
      - { name: "zip_code",      type: string }
      - { name: "registered_at", type: timestamp, format: "%Y-%m-%d %H:%M:%S" }
      - { name: "vegetarian",    type: boolean }
      - { name: "age",           type: long }
      - { name: "ratio",         type: double }

Preview results

*************************** 1 ***************************
         name (   string) : Hugh Rutherford
         city (   string) : Mitchellfurt
  street_name (   string) : Ondricka Island
     zip_code (   string) : 75232
registered_at (timestamp) : 2015-09-08 20:28:45 UTC
   vegetarian (  boolean) : true
          age (     long) : 44
        ratio (   double) : 79.092
*************************** 2 ***************************
         name (   string) : Miss Carmella Bashirian
         city (   string) : Madilynchester
  street_name (   string) : Rhea Walks
     zip_code (   string) : 44398
registered_at (timestamp) : 2014-06-30 19:25:27 UTC
   vegetarian (  boolean) : true
          age (     long) : 73
        ratio (   double) : 50.608

Handle more complicated json

If you want to handle more complicated json, you can specify jsonpath to also path in columns section like as follows:

{
    "result" : "success",
    "students" : [
      { "names" : ["John", "Lennon"], "age" : 10 },
      { "names" : ["Paul", "Maccartney"], "age" : 10 }
    ]
}
root: $.students
columns:
  - {name: firstName, type: string, path: "names[0]"}
  - {name: lastName, type: string, path: "names[1]"}

In this case, names[0] will be firstName of schema and names[1] will be lastName.

Guess

This plugin supports minimal guess feature. You don't have to write parser: section in the configuration file. After writing in: section, you can let embulk guess parser: section using this command:

$ embulk gem install embulk-parser-jsonpath
$ embulk guess -g jsonpath config.yml -o guessed.yml

Example

If you want to guess the following JSON file, (This JSON data start with array) You don't have to need parser section.

[
  {
    "name": "Hugh Rutherford",
    "city": "Mitchellfurt",
    "street_name": "Ondricka Island",
    "zip_code": "75232",
    "registered_at": "2015-09-09 05:28:45",
    "vegetarian": true,
    "age": 44,
    "ratio": 79.092
  }
]
in:
  type: file
  path_prefix: example/hoge
out:
  type: stdout

However, If a JSON data doesn't start with array, You have to specify root parameter explicitly.

{
  "count": 100,
  "page": 1,
  "results": [
    {
      "name": "Hugh Rutherford",
      "city": "Mitchellfurt",
      "street_name": "Ondricka Island",
      "zip_code": "75232",
      "registered_at": "2015-09-09 05:28:45",
      "vegetarian": true,
      "age": 44,
      "ratio": 79.092
    }
  ]
}
in:
  type: file
  path_prefix: example/input
  parser:
    type: jsonpath
    root: "$.results"
out:
  type: stdout

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Acknowledgment

I would like to express my special thanks to the developers of embulk-parser-jsonl and embulk-filter-typecast projects.

Almost codes copied from this project.