Row filter plugin for Embulk
A filter plugin for Embulk to filter out rows
Configuration
Versions >= 0.3.0 has where option to support SQL-like syntax.
- where: Select only rows which matches with conditions written in SQL-like syntax. See SQL-like Syntax
Example
filters:
- type: row
where: column1 = 'str'
filters:
- type: row
where: |-
(
string START_WITH 'str' AND
number > 1.0
)
OR
(
time = TIMESTAMP '2016-01-01 +0900' AND
"true" = true
)
See SQL-like Syntax for more details
SQL-like Syntax
Versions >= 0.3.0 suppors SQL-like syntax as:
where: |-
(
string START_WITH 'str' AND
number > 1.0
)
OR
(
time = TIMESTAMP '2016-01-01 +0900' AND
"true" = true AND
string REGEXP '^reg'
)
Literals
Boolean Literal
true or TRUE or false or FALSE are considered as a boolean literal
Number Literal
Characters matching with a regular expression -?[0-9]+(\.[0-9]+)? is considered as a number literal
String Literal
Characters surrounded by ' such as 'foo' is considered as a string literal
Timestamp Literal
NOTE: It became possible to omit TIMESTAMP keyword on comparing with timestamp identifier (column) from version >= 0.3.3.
TIMESTAMP ( NumberLiteral | StringLiteral ) such as TIMESTAMP 1470433087.747123 or TIMESTAMP '2016-08-06 06:38:07.747123 +0900' is considered as a timestamp literal
Number is a epoch time since 1970-01-01 UTC with nano time resolution.
String is a timestamp string which matches with one of following format:
%Y-%m-%d %H:%M:%S.%N %z%Y-%m-%d %H:%M:%S.%N%Y-%m-%d %H:%M:%S %z%Y-%m-%d %H:%M:%S%Y-%m-%d %z%Y-%m-%d
The time zone for formats without %z is UTC, and the time resolution is micro second (caused by limitation of Embulk TimestampParser).
Json Literal
Not supported yet
Identifier Literal
Characters matching with a regular expression [a-zA-Z_][a-zA-z0-9_]* such as foobar, and characters surrounded by " such as "foo-bar", "foo.bar", and "foo\"bar" are considred as an identifier literal, that is, embulk's column name.
Operators
Boolean Operator
=!=
Number Operator (Long and Double)
=!=>>=<=<
String Operator
=!=START_WITHEND_WITHINCLUDEREGEXP
Timestamp Operator
=!=>>=<=<
Json Operator
Not supported yet
unary operator
- "xxx IS NULL"
- "xxx IS NOT NULL"
- "NOT xxx"
Old Configuration
Versions >= 0.3.0 has where option to supports SQL-like syntax. I recommend to use it.
Following options are deprecated, and will be removed someday.
- condition: AND or OR (string, default: AND).
- conditions: select only rows which matches with conditions.
- column: column name (string, required)
- operator operator (string, optional, default: ==)
- boolean operator
==!=
- numeric operator (long, double, Timestamp)
==!=>>=<=<
- string operator
==!=start_with(orstartsWith)end_with(orendsWith)include(orcontains)
- unary operator
IS NULLIS NOT NULL
- argument: argument for the operation (string, required for non-unary operators)
- not: not (boolean, optional, default: false)
- format: special option for timestamp column, specify the format of timestamp argument, parsed argument is compared with the column value as Timestamp object (string, default is
%Y-%m-%d %H:%M:%S.%N %z) - timezone: special option for timestamp column, specify the timezone of timestamp argument (string, default is
UTC)
NOTE: column type is automatically retrieved from input data (inputSchema)
Example (AND)
Deprecated
filters:
- type: row
condition: AND
conditions:
- {column: foo, operator: "IS NOT NULL"}
- {column: id, operator: ">=", argument: 10}
- {column: id, operator: "<", argument: 20}
- {column: name, opeartor: "include", argument: foo, not: true}
- {column: time, operator: "==", argument: "2015-07-13", format: "%Y-%m-%d"}
Example (OR)
Deprecated
filters:
- type: row
condition: OR
conditions:
- {column: a, operator: "IS NOT NULL"}
- {column: b, operator: "IS NOT NULL"}
Example (AND of OR)
Deprecated
You can express a condition such as (A OR B) AND (C OR D) by combining multiple filters like
filters:
- type: row
condition: OR
conditions:
- {column: a, operator: "IS NOT NULL"}
- {column: b, operator: "IS NOT NULL"}
- type: row
condition: OR
conditions:
- {column: c, operator: "IS NOT NULL"}
- {column: d, operator: "IS NOT NULL"}
Comparisions
- embulk-filter-calcite
- embulk-filter-calcite is a pretty nice plugin which enables us to write SQL query to filter embulk records.
- However, based on my benchmark (Japanese), embulk-filter-calcite was 1471 times slower than embulk-filter-row in iterms of string
=operator. Useembulk-filter-row(this plugin) when you need performance.
ToDo
- Support filtering by values of
type: jsonwith JSONPath - Support IN operator
ChangeLog
Development
Run example:
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
Run test:
$ ./gradlew test
Run checkstyle:
$ ./gradlew check
Release gem:
$ ./gradlew gemPush
Development of SQL-like Syntax
Read the article Supported SQL-like Syntax with embulk-filter-row using BYACC/J and JFlex.
To download BYACC/J and JFlex and run them, you can use:
$ script/byaccj.sh
or
$ ./gradlew byaccj # this runs script/byaccj.sh internally
This generates src/main/java/org/embulk/filter/row/where/{Parser,ParserVal,Yylex}.java.
The byaccj task of gradle is ran before compileJava task (which means to be ran before classpath or test task also) automatically.
