finite_mdp

SYNOPSIS

Solve small, finite Markov Decision Process (MDP) models.

This library provides several ways of describing an MDP model (see FiniteMDP::Model) and some reasonably efficient implementations of policy iteration and value iteration to solve it (see FiniteMDP::Solver).

Usage

Example 1: Recycling Robot

The following shows how to solve the recycling robot model (example 3.7) from <cite>Sutton and Barto (1998). Reinforcement Learning: An Introduction</cite>.

<blockquote> At each time step, the robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone to bring it a can, or (3) go back to home base to recharge its battery. The best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In this case the robot must shut down and wait to be rescued (producing a low reward). The agent makes its decisions solely as a function of the energy level of the battery. It can distinguish two levels, high and low. </blockquote>

The transition model is described in Table 3.1, which can be fed directly into FiniteMDP using the FiniteMDP::TableModel, as follows.

require 'finite_mdp'

alpha    = 0.1 # Pr(stay at high charge if searching | now have high charge)
beta     = 0.1 # Pr(stay at low charge if searching | now have low charge)
r_search = 2   # reward for searching
r_wait   = 1   # reward for waiting
r_rescue = -3  # reward (actually penalty) for running out of charge

model = FiniteMDP::TableModel.new [
  [:high, :search,   :high, alpha,   r_search],
  [:high, :search,   :low,  1-alpha, r_search],
  [:low,  :search,   :high, 1-beta,  r_rescue],
  [:low,  :search,   :low,  beta,    r_search],
  [:high, :wait,     :high, 1,       r_wait],
  [:high, :wait,     :low,  0,       r_wait],
  [:low,  :wait,     :high, 0,       r_wait],
  [:low,  :wait,     :low,  1,       r_wait],
  [:low,  :recharge, :high, 1,       0],
  [:low,  :recharge, :low,  0,       0]]

solver = FiniteMDP::Solver.new(model, 0.95) # discount factor 0.95
solver.policy_iteration 1e-4
solver.policy #=> {:high=>:search, :low=>:recharge}

Example 2: Grid Worlds

A more complicated example: the grid world from <cite>Russel and Norvig (2003). Artificial Intelligence: A Modern Approach</cite>, Chapter 17.

Here we describe the model as a class that implements the FiniteMDP::Model interface. The model contains terminal states, which we represent with a special absorbing state with zero reward, called :stop.

require 'finite_mdp'

class AIMAGridModel
  include FiniteMDP::Model

  #
  # @param [Array<Array<Float, nil>>] grid rewards at each point, or nil if a
  #        grid square is an obstacle
  #
  # @param [Array<[i, j]>] terminii coordinates of the terminal states
  #
  def initialize grid, terminii
    @grid, @terminii = grid, terminii
  end

  attr_reader :grid, :terminii

  # every position on the grid is a state, except for obstacles, which are
  # indicated by a nil in the grid
  def states
    is, js = (0...grid.size).to_a, (0...grid.first.size).to_a
    is.product(js).select {|i, j| grid[i][j]} + [:stop]
  end

  # can move north, east, south or west on the grid
  MOVES = {
    '^' => [-1,  0], 
    '>' => [ 0,  1], 
    'v' => [ 1,  0], 
    '<' => [ 0, -1]} 

  # agent can move north, south, east or west (unless it's in the :stop
  # state); if it tries to move off the grid or into an obstacle, it stays
  # where it is
  def actions state
    if state == :stop || terminii.member?(state)
      [:stop]
    else
      MOVES.keys
    end
  end

  # define the transition model
  def transition_probability state, action, next_state
    if state == :stop || terminii.member?(state)
      (action == :stop && next_state == :stop) ? 1 : 0
    else
      # agent usually succeeds in moving forward, but sometimes it ends up
      # moving left or right
      move = case action
             when '^' then [['^', 0.8], ['<', 0.1], ['>', 0.1]]
             when '>' then [['>', 0.8], ['^', 0.1], ['v', 0.1]]
             when 'v' then [['v', 0.8], ['<', 0.1], ['>', 0.1]]
             when '<' then [['<', 0.8], ['^', 0.1], ['v', 0.1]]
             end
      move.map {|m, pr|
        m_state = [state[0] + MOVES[m][0], state[1] + MOVES[m][1]]
        m_state = state unless states.member?(m_state) # stay in bounds
        pr if m_state == next_state
      }.compact.inject(:+) || 0
    end
  end

  # reward is given by the grid cells; zero reward for the :stop state
  def reward state, action, next_state
    state == :stop ? 0 : grid[state[0]][state[1]]
  end

  # helper for functions below
  def hash_to_grid hash
    0.upto(grid.size-1).map{|i| 0.upto(grid[i].size-1).map{|j| hash[[i,j]]}}
  end

  # print the values in a grid
  def pretty_value value
    hash_to_grid(Hash[value.map {|s, v| [s, "%+.3f" % v]}]).map{|row|
      row.map{|cell| cell || '      '}.join(' ')}
  end

  # print the policy using ASCII arrows
  def pretty_policy policy
    hash_to_grid(policy).map{|row| row.map{|cell|
      (cell.nil? || cell == :stop) ? ' ' : cell}.join(' ')}
  end
end

# the grid from Figures 17.1, 17.2(a) and 17.3
model = AIMAGridModel.new(
  [[-0.04, -0.04, -0.04,    +1],
   [-0.04,   nil, -0.04,    -1],
   [-0.04, -0.04, -0.04, -0.04]],
   [[0, 3], [1, 3]]) # terminals (the +1 and -1 states)

# sanity check: successor state probabilities must sum to 1
model.check_transition_probabilities_sum

solver = FiniteMDP::Solver.new(model, 1) # discount factor 1
solver.value_iteration(1e-5, 100) #=> true if converged

puts model.pretty_policy(solver.policy)
# output: (matches Figure 17.2(a))
# > > >  
# ^   ^  
# ^ < < <

puts model.pretty_value(solver.value)
# output: (matches Figure 17.3)
#  0.812  0.868  0.918  1.000
#  0.762         0.660 -1.000
#  0.705  0.655  0.611  0.388

FiniteMDP::TableModel.from_model(model)
#=> [[0, 0], "v", [0, 0], 0.1, -0.04]
#   [[0, 0], "v", [0, 1], 0.1, -0.04]
#   [[0, 0], "v", [1, 0], 0.8, -0.04]
#   [[0, 0], "<", [0, 0], 0.9, -0.04]
#   [[0, 0], "<", [1, 0], 0.1, -0.04]
#   [[0, 0], ">", [0, 0], 0.1, -0.04]
#   [[0, 0], ">", [0, 1], 0.8, -0.04]
#   [[0, 0], ">", [1, 0], 0.1, -0.04]
#   ...
#   [:stop, :stop, :stop, 1, 0]

Note that python code for this model is also available from the book's authors at aima.cs.berkeley.edu/python/mdp.html

REQUIREMENTS

Tested on

  • ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]

  • ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]

  • ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]

INSTALLATION

gem install finite_mdp

LICENSE

(The MIT License)

Copyright © 2011 John Lees-Miller

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.