Wukong Deploy Pack

The Infochimps Platform is an end-to-end, managed solution for building Big Data applications. It integrates best-of-breed technologies like Hadoop, Storm, Kafka, MongoDB, ElasticSearch, HBase, &c. and provides simple interfaces for accessing these powerful tools.

Computation, analytics, scripting, &c. are all handled by Wukong within the platform. Wukong is an abstract framework for defining computations on data. Wukong processors and flows can run in many different execution contexts including:

locally on the command-line for testing or development purposes
as a Hadoop mapper or reducer for batch analytics or ETL
within Storm as part of a real-time data flow

The Infochimps Platform uses the concept of a deploy pack for developers to develop all their processors, flows, and jobs within. The deploy pack can be thought of as a container for all the necessary Wukong code and plugins useful in the context of an Infochimps Platform application. It includes the following libraries:

wukong: The core framework for writing processors and chaining them together.
wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.

Installation

The deploy pack is installed as a RubyGem:

$ sudo gem install wukong-deploy

File Structure

A deploy pack is a repository with the following Rails-like file structure:

├──   app
│   ├──   models
│   ├──   processors
│   ├──   flows
│   └──   jobs
├──   config
│   ├──   environment.rb
│   ├──   application.rb
│   ├──   initializers
│   ├──   settings.yml
│   └──   environments
│       ├──   development.yml
│       ├──   production.yml
│       └──   test.yml
├──   data
├──   Gemfile
├──   Gemfile.lock
├──   lib
├──   log
├──   Rakefile
├──   spec
│   ├──   spec_helper.rb
│   └──   support
└──   tmp

Let's look at it piece by piece:

app: The directory with all the action. It's where you define:
- models: Your domain models or "nouns", which define and wrap the different kinds of data elements in your application. They are built using whatever framework you like (defaults to Gorillib)
- processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
- flows: Chain together processors into streaming flows for ingestion, real-time processing, or complex event processing (CEP)
- jobs: Pair processors together to create batch jobs to run in Hadoop
config: Where you place all application configuration for all environments
- environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
- application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
- initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
- settings.yml: Defines application-wide settings.
- environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
data: Holds sample data in flat files. You'll develop and test your application using this data.
Gemfile and Gemfile.lock: Defines how libraries are resolved with Bundler.
lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
log: A good place to stash logs.
Rakefile: Defines Rake tasks for the development, test, and deploy of your application.
spec: Holds all your RSpec unit tests.
- spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
- support: Holds support code for your tests.
tmp: A good place to stash temporary files.