cluster_chef

Infrastructure as code: describe and orchestrate whole clusters of cloud or virtual machines.

Overview

The cluster_chef toolset takes chef to the next level:

  • cluster_chef gem: assembles machines and roles into a working cluster using a clean, expressive domain-specific language.
  • cluster_chef-knife knife plugin: orchestrate clusters of machines, in the cloud or VMs.
  • cluster_chef-homebase: Our collection of industrial-strength, cloud-ready recipes for Hadoop, HBase, Cassandra, Elasticsearch, Zabbix and more.
  • the metachef cookbook: coordinate discovery of services ("list all the machines for awesome_webapp, that I might load balance them") and aspects ("

Walkthrough

Here's a very simple cluster:

ClusterChef.cluster 'awesome' do
  cloud(:ec2) do
    flavor              't1.micro'
  end

  role                  :base_role
  role                  :chef_client
  role                  :ssh

  # The database server
  facet :dbnode do
    instances           1
    role                :mysql_server

    cloud do
      flavor           'm1.large'
      backing          'ebs'
    end
  end

  # A throwaway facet for development.
  facet :webnode do
    instances           2
    role                :nginx_server
    role                :awesome_webapp
  end
end

This code defines a cluster named demosimple. A cluster is a group of servers united around a common purpose, in this case to serve a scalable web application.

The awesome cluster has two 'facets' -- dbnode and webnode. A facet is a subgroup of interchangeable servers that provide a logical set of systems: in this case, the systems that store the website's data and those that render it.

The dbnode facet has one server, which will be named 'awesome-dbnode-0'; the webnode facet has two servers, 'awesome-webnode-0' and 'awesome-webnode-1'.

Each server inherits the appropriate behaviors from its facet and cluster. All the servers in this cluster have the base_role, chef_client and ssh roles. The dbnode machines additionally house a MySQL server, while the webnodes have an nginx reverse proxy for the custom awesome_webapp.

As you can see, the dbnode facet asks for a different flavor of machine ('m1.large') than the cluster default ('t1.micro'). Settings in the facet override those in the server, and settings in the server override those of its facet. You economically describe only what's significant about each machine.

Cluster-level tools

$ knife cluster show awesome

  +--------------------+-------+------------+-------------+--------------+---------------+-----------------+----------+--------------+------------+------------+
  | Name               | Chef? | InstanceID | State       | Public IP    | Private IP    | Created At      | Flavor   | Image        | AZ         | SSH Key    |
  +--------------------+-------+------------+-------------+--------------+---------------+-----------------+----------+--------------+------------+------------+
  | awesome-dbnode-0   | yes   | i-43c60e20 | running     | 107.22.6.104 | 10.88.112.201 | 20111029-204156 | t1.micro | ami-cef405a7 | us-east-1a | awesome    |
  | awesome-webnode-0  | yes   | i-1233aef1 | running     | 102.99.3.123 | 10.88.112.123 | 20111029-204156 | t1.micro | ami-cef405a7 | us-east-1a | awesome    |
  | awesome-webnode-1  | yes   | i-0986423b | not running |              |               |                 |          |              |            |            |
  +--------------------+-------+------------+-------------+--------------+---------------+-----------------+----------+--------------+------------+------------+


The commands available are

  • list -- lists known clusters
  • show -- show the named servers
  • launch -- launch server
  • bootstrap
  • sync
  • ssh
  • start/stop
  • kill
  • kick -- trigger a chef-client run on each named machine, tailing the logs until the run completes

Advanced clusters remain simple

Let's say that app is truly awesome, and the features and demand increases. This cluster adds an ElasticSearch server for searching, a haproxy loadbalancer, and spreads the webnodes across two availability zones.

ClusterChef.cluster 'webserver_demo' do
  cloud(:ec2) do
    image_name          "maverick"
    flavor              "t1.micro"
    availability_zones  ['us-east-1a']
  end

  # The database server
  facet :dbnode do
    instances           1
    role                :mysql_server
    cloud do
      flavor           'm1.large'
      backing          'ebs'
    end

    volume(:data) do
      size              20
      keep              true                        
      device            '/dev/sdi'                  
      mount_point       '/data'              
      snapshot_id       'snap-a10234f'             
      attachable        :ebs
    end
  end

  facet :webnode do
    instances           6
    cloud.availability_zones  ['us-east-1a', 'us-east-1b']

    role                :nginx_server
    role                :awesome_webapp
    role                :elasticsearch_client

    volume(:server_logs) do
      size              5                           
      keep              true                        
      device            '/dev/sdi'                  
      mount_point       '/server_logs'              
      snapshot_id       'snap-d9c1edb1'             
    end
  end

  facet :esnode do
    instances           1
    role                "elasticsearch_data_esnode"
    role                "elasticsearch_http_esnode"
    cloud.flavor        "m1.large"
  end

  facet :loadbalancer do
    instances           1
    role                "haproxy"
    cloud.flavor        "m1.xlarge"
    elastic_ip          "128.69.69.23"
  end

  cluster_role.override_attributes({
    :elasticsearch => {
      :version => '0.17.8',
    },
  })
end

The facets are described and scale independently. If you'd like to add more webnodes, just increase the instance count. If a machine misbehaves, just terminate it. Running knife cluster launch awesome webnode will note which machines are missing, and launch and configure them appropriately.

ClusterChef speaks naturally to both Chef and your cloud provider. The esnode's cluster_role.override_attributes statement will be synchronized to the chef server, pinning the elasticsearch version across the clients and server.. Your chef roles should focus system-specific information; the cluster file lets you see the architecture as a whole.

With these simple settings, if you have already set up chef's knife to launch cloud servers, typing knife cluster launch demosimple --bootstrap will (using Amazon EC2 as an example):

  • Synchronize to the chef server:
    • create chef roles on the server for the cluster and each facet.
    • apply role directives (eg the homebase's default_attributes declaration).
    • create a node for each machine
    • apply the runlist to each node
  • Set up security isolation:
    • uses a keypair (login ssh key) isolated to that cluster
    • Recognizes the ssh role, and add a security group ssh that by default opens port 22.
    • Recognize the nfs_server role, and adds security groups nfs_server and nfs_client
    • Authorizes the nfs_server to accept connections from all nfs_clients. Machines in other clusters that you mark as nfs_clients can connect to the NFS server, but are not automatically granted any other access to the machines in this cluster. ClusterChef's opinionated behavior is about more than saving you effort -- tying this behavior to the chef role means you can't screw it up.
  • Launches the machines in parallel:
    • using the image name and the availability zone, it determines the appropriate region, image ID, and other implied behavior.
    • passes a JSON-encoded user_data hash specifying the machine's chef node_name and client key. An appropriately-configured machine image will need no further bootstrapping -- it will connect to the chef server with the appropriate identity and proceed completely unattended.
  • Syncronizes to the cloud provider:
    • Applies EC2 tags to the machine, making your console intelligible: AWS Console screenshot
    • Connects external (EBS) volumes, if any, to the correct mount point -- it uses (and applies) tags to the volumes, so they know which machine to adhere to. If you've manually added volumes, just make sure they're defined correctly in your cluster file and run knife cluster sync {cluster_name}; it will paint them with the correct tags.
    • Associates an elastic IP, if any, to the machine
  • Bootstraps the machine using knife bootstrap

Getting Started

This assumes you have installed chef, have a working chef server, and have an AWS account. If you can run knife and use the web browser to see your EC2 console, you can start here. If not, see the instructions below.

Setup

bundle install

Your first cluster

Let's create a cluster called 'demosimple'. It's, well, a simple demo cluster.

Create a simple demo cluster

Create a directory for your clusters; copy the demosimple cluster and its associated roles from cluster_chef:

        mkdir -p $CHEF_REPO_DIR/clusters
        cp cluster_chef/clusters/{defaults,demosimple}.rb ./clusters/
        cp cluster_chef/roles/{big_package,nfs_*,ssh,base_role,chef_client}.rb  ./roles/

Lastly, add the cookbooks, site-cookbooks, and meta-cookbooks directories from cluster_chef to the cookbooks_path in your knife.rb, and push everything to the chef server. (see below for details).

knife cluster launch

Hooray! You're ready to launch a cluster:

    knife cluster launch demosimple homebase --bootstrap```

It will kick off a node and then bootstrap it. You'll see it install a whole bunch of things. Yay.

__________________________________________________________________________

## Philosophy

Some general principles of how we use chef.

* *Chef server is never the repository of truth* -- it only mirrors the truth.
  - a file is tangible and immediate to access
* Specifically, we want truth to live in the git repo, and be enforced by the chef server. *There is no truth but git, and chef is its messenger*.
  - this means that everything is versioned, documented and exchangeable.
* *Systems, services and significant modifications cluster should be obvious from the `clusters` file*.  I don't want to have to bounce around nine different files to find out which thing installed a redis:server.
  - basically, the existence of anything that opens a port should be obvious when I look at the cluster file.
* *Roles define systems, clusters assemble systems into a machine*.
  - For example, a resque worker queue has a redis, a webserver and some config files -- your cluster should invoke a @whatever_queue@ role, and the @whatever_queue@ role should include recipes for the component services.
  - the existence of anything that opens a port _or_ runs as a service should be obvious when I look at the roles file.
* *include_recipe considered harmful* Do NOT use include_recipe for anything that a) provides a service, b) launches a daemon or c) is interesting in any way. (so: @include_recipe java@ yes; @include_recipe iptables@ no.) You should note the dependency in the metadata.rb. This seems weird, but the breaking behavior is purposeful: it makes you explicitly state all dependencies.
* It's nice when *machines are in full control of their destiny*.
  - initial setup (elastic IP, attaching a drive) is often best enforced externally
  - but machines should be ablt independently assert things like load balancer registration that that might change at any point in the lifetime.
* It's even nicer, though, to have *full idempotency from the command line*: I can at any time push truth from the git repo to the chef server and know that it will take hold.

__________________________________________________________________________

## Advanced Superpowers

#### Auto-vivifying machines (no bootstrap required!)

On EC2, you can make a machine that auto-vivifies -- no bootstrap necessary. Burn an AMI that has the `config/client.rb` file in /etc/chef/client.rb. It will use the ec2 userdata (passed in by knife) to realize its purpose in life, its identity, and the chef server to connect to; everything happens automagically from there. No parallel ssh required!

#### EBS Volumes

Define a `snapshot_id` for your volumes, and set `create_at_launch` true.

__________________________________________________________________________


## Extended Installation Notes

#### Set up Knife on your local machine, and a Chef Server in the cloud

If you already have a working chef installation you can skip this section.

To get started with knife and chef, follow the "Chef Quickstart,":http://wiki.opscode.com/display/chef/Quick+Start We use the hosted chef service and are very happy, but there are instructions on the wiki to set up a chef server too. Stop when you get to "Bootstrap the Ubuntu system" -- cluster chef is going to make that much easier.

#### Cloud setup

Next,

* sign up for an AWS account
* Follow the "Knife with AWS quickstart": on the opscode wiki.

Right now cluster chef works well with AWS.  If you're interested in modifying it to work with other cloud providers, "see here":https://github.com/infochimps/cluster_chef/issues/28 or get in touch.

#### Knife setup

In your `.chef/knife.rb`, modify the cookbook path to include cluster_chef's `cookbooks`, `meta-cookbooks` and `site-cookbooks`, and to add settings for `cluster_chef_path`, `cluster_path` and `keypair_path`. Here's mine:

    current_dir = File.dirname(__FILE__)
    organization  = 'CHEF_ORGANIZATION'
    username      = 'CHEF_USERNAME'

    # The full path to your cluster_chef installation
    cluster_chef_path File.expand_path("#{current_dir}/../cluster_chef")
    # The list of paths holding clusters
    cluster_path      [ File.expand_path("#{current_dir}/../clusters") ]
    # The directory holding your cloud keypairs
    keypair_path      File.expand_path(current_dir)

    log_level                :info
    log_location             STDOUT
    node_name                username
    client_key               "#{keypair_path}/#{username}.pem"
    validation_client_name   "#{organization}-validator"
    validation_key           "#{keypair_path}/#{organization}-validator.pem"
    chef_server_url          "https://api.opscode.com/organizations/#{organization}"
    cache_type               'BasicFile'
    cache_options( :path => "#{ENV['HOME']}/.chef/checksums" )

    # The first things have lowest priority (so, site-cookbooks gets to win)
    cookbook_path            [
      "#{cluster_chef_path}/cookbooks",      # std cookbooks from opscode/cookbooks
      "#{cluster_chef_path}/meta-cookbooks", # coordinate services among cookbooks
      "#{cluster_chef_path}/site-cookbooks", # infochimps' collection of cookbooks
      "#{current_dir}/../cookbooks",         
      "#{current_dir}/../site-cookbooks",    # your internal cookbooks
    ]

    # If you primarily use AWS cloud services:
    knife[:ssh_address_attribute] = 'cloud.public_hostname'
    knife[:ssh_user] = 'ubuntu'

    # Configure bootstrapping
    knife[:bootstrap_runs_chef_client] = true
    bootstrap_chef_version   "~> 0.10.0"

    # AWS access credentials
    knife[:aws_access_key_id]      = "XXXXXXXXXXX"
    knife[:aws_secret_access_key]  = "XXXXXXXXXXXXX"

#### Push to chef server

To send all the cookbooks and role to the chef server, visit your cluster_chef directory and run:

```ruby
        cd $CHEF_REPO_DIR
        mkdir -p $CHEF_REPO_DIR/site-cookbooks
        knife cookbook upload --all
        for foo in roles/*.rb ; do knife role from file $foo & sleep 1 ; done

You should see all the cookbooks defined in cluster_chef/cookbooks (ant, apt, ...) listed among those it uploads.