Wukong Deploy Pack
The Infochimps Platform is an end-to-end, managed solution for building Big Data applications. It integrates best-of-breed technologies like Hadoop, Storm, Kafka, MongoDB, ElasticSearch, HBase, &c. and provides simple interfaces for accessing these powerful tools.
Computation, analytics, scripting, &c. are all handled by Wukong within the platform. Wukong is an abstract framework for defining computations on data. Wukong processors and flows can run in many different execution contexts including:
- locally on the command-line for testing or development purposes
- as a Hadoop mapper or reducer for batch analytics or ETL
- within Storm as part of a real-time data flow
The Infochimps Platform uses the concept of a deploy pack for developers to develop all their processors, flows, and jobs within. The deploy pack can be thought of as a container for all the necessary Wukong code and plugins useful in the context of an Infochimps Platform application. It includes the following libraries:
- wukong: The core framework for writing processors and chaining them together.
- wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
- wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
Installation
The deploy pack is installed as a RubyGem:
$ sudo gem install wukong-deploy
File Structure
A deploy pack is a repository with the following Rails-like file structure:
├── app
│ ├── models
│ ├── processors
│ ├── flows
│ └── jobs
├── config
│ ├── environment.rb
│ ├── application.rb
│ ├── initializers
│ ├── settings.yml
│ └── environments
│ ├── development.yml
│ ├── production.yml
│ └── test.yml
├── data
├── Gemfile
├── Gemfile.lock
├── lib
├── log
├── Rakefile
├── spec
│ ├── spec_helper.rb
│ └── support
└── tmp
Let's look at it piece by piece:
- app: The directory with all the action. It's where you define:
- models: Your domain models or "nouns", which define and wrap the different kinds of data elements in your application. They are built using whatever framework you like (defaults to Gorillib)
- processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
- flows: Chain together processors into streaming flows for ingestion, real-time processing, or complex event processing (CEP)
- jobs: Pair processors together to create batch jobs to run in Hadoop
- config: Where you place all application configuration for all environments
- environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
- application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
- initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
- settings.yml: Defines application-wide settings.
- environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
- data: Holds sample data in flat files. You'll develop and test your application using this data.
- Gemfile and Gemfile.lock: Defines how libraries are resolved with Bundler.
- lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
- log: A good place to stash logs.
- Rakefile: Defines Rake tasks for the development, test, and deploy of your application.
- spec: Holds all your RSpec unit tests.
- spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
- support: Holds support code for your tests.
- tmp: A good place to stash temporary files.