What's Embulk?

Embulk is a plugin-based parallel bulk data loader that helps data transfer between various storages, databases, NoSQL and cloud services.

You can release plugins to share your efforts of data cleaning, error handling, transaction control, and retrying. Packaging efforts into plugins brings OSS-style development to the data scripts which was tend to be one-time adhoc scripts.

Embulk, an open-source plugin-based parallel bulk data loader at Slideshare

Embulk

Document

Quick Start

The single-file package is the simplest way to try Embulk. You can download the latest embulk-VERSION.jar from the releases page and run it with java.

Linux & Mac & BSD

Embulk is a Java application. Please make sure that you installed Java.

Following 4 commands install embulk to your home directory:

curl --create-dirs -o ~/.embulk/bin/embulk -L "http://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Next step: Trying the example

Windows

Embulk is a Java application. Please make sure that you installed Java.

You can assume the jar file is a .bat file.

PowerShell -Command "& {Invoke-WebRequest http://dl.embulk.org/embulk-latest.jar -OutFile embulk.bat}"

Next step: Trying the example

Trying the example

Let's load a CSV file, for example. embulk example subcommand generates a csv file and config file for you.

embulk example ./try1
embulk guess   ./try1/example.yml -o config.yml
embulk preview config.yml
embulk run     config.yml

Next step: Using plugins

Using plugins

You can use plugins to load data from/to various systems and file formats. Here is the list of publicly released plugins: list of plugins by category.

An example is embulk-output-command plugin. It executes an external command to output the records.

To install plugins, you can use embulk gem install <name> command:

embulk gem install embulk-output-command
embulk gem list

Embulk bundles some built-in plugins such as embulk-encoder-gzip or embulk-formatter-csv. You can use those plugins with following configuration file:

in:
  # ...
out:
  type: command
  command: "cat - > task.$INDEX.$SEQID.csv.gz"
  encoders:
    - {type: gzip}
  formatter:
    type: csv

Using plugin bundle

embulk bundle subcommand creates (or updates if already exists) a private (isolated) bundle of a plugins. You can use the bundle using -b <bundle_dir> option. embulk bundle also generates some example plugins to <bundle_dir>/embulk/*.rb directory.

See the generated <bundle_dir>/Gemfile file how to plugin bundles work.

embulk bundle ./embulk_bundle
embulk guess  -b ./embulk_bundle ...
embulk run    -b ./embulk_bundle ...

Releasing plugins to RubyGems

TODO: documents

embulk-plugin-xyz

Resuming a failed transaction

Embulk supports resuming failed transactions. To enable resuming, you need to start transaction with -r PATH option:

embulk run config.yml -r resume-state.yml

If the transaction fails, embulk stores state some states to the yaml file. You can retry the transaction using exactly same command:

embulk run config.yml -r resume-state.yml

If you give up on resuming the transaction, you can use embulk cleanup subcommand to delete intermediate data:

embulk cleanup config.yml -r resume-state.yml

Embulk Development

Build

./gradlew cli  # creates pkg/embulk-VERSION.jar
./gradlew gem  # creates pkg/embulk-VERSION.gem

You can see JaCoCo's test coverage report at ${project}/build/reports/tests/index.html You can see Findbug's report at ${project}/build/reports/findbug/main.html # FIXME coverage information is not included somehow

You can use classpath task to use ./bin/embulk for development:

./gradlew classpath  # -x test: skip test
./bin/embulk

To deploy artifacts to your local maven repository at ~/.m2/repository/:

./gradlew install

To compile the source code of embulk-core project only:

./gradlew :embulk-core:compileJava

Task dependencies shows dependency tree of embulk-core project:

./gradlew :embulk-core:dependencies

Documents

Embulk uses Sphinx, YARD (Ruby API) and JavaDoc (Java API) for document generation.

brew install python
pip install sphinx
gem install yard
./gradlew site
# documents are: embulk-docs/build/html

Release

You need to add your bintray account information to ~/.gradle/gradle.properties

bintray_user=(bintray user name)
bintray_api_key=(bintray api key)

Run following commands and follow its instruction:

./gradlew set_version -Pto=$VERSION
./gradlew releaseCheck
./gradlew release
git commit -am v$VERSION
git tag v$VERSION

See also: