Recmon

Recmon is a host-based system monitor. It complements REC, the Ruby Event Correlator.

Installation

$ sudo gem install recmon

Usage

Require the recmon gem:

require 'rubygems'
require 'recmon'

Create a monitor to obtain readings from the sensors at the right time:

s = Recmon::Monitor.new()

Now add sensors for each aspect you want to monitor:

s.web("Website", "http://www.example.com/index.html")
s.diskspace("Database", "/var/postgres/data/")
s.filesize("Messages", "/var/log/messages")
s.ping("earth", "206.125.172.58")

Then start the monitor running:

s.start()

and it will periodically write entries to the log file (default = /var/log/recmon.log):

2012-09-13T16:08:59+10:00 Recmon is monitoring.
...
2012-09-13T16:08:59+10:00 site=Website status=down
2012-09-13T16:08:59+10:00 Database usage=247 MB
2012-09-13T16:08:59+10:00 Messages filesize=83 KB
2012-09-13T16:08:59+10:00 ping host=earth status=up
...
2012-09-13T16:09:16+10:00 Recmon is exiting.

Sensors

There are several sensors:

  • web: ensure a website is responding

  • diskfree: track disk free space

  • diskspace: track the diskspace used by a folder

  • filesize: monitor the size of a file

  • ping: check if a server is alive

  • proc: look for a named process

  • ssh: ensure the SSH service is running

  • command: run an arbitrary command and report success

Common characteristics

All sensors have a name which is used to distinguish it from others of the same type. For example, you can monitor the earth server and the terra server. The name is used in composing the log entry.

All sensors have a sane default frequency, so it is not necessary to specify a frequency unless you want to override the default.

Sensors compose log messages that is designed for further processing. Although they are readable, it is more important that they are parsable, so they always start with an ISO8601 date time and then several name=value pairs.

web: Ensure a website is responding

To periodically check if a website is reachable, add a WebSensor, specifying the title, the URL, and optionally a frequency (default = 60 seconds).

s.web(name, url, freq=60)
s.web("Main", "http://www.finalstep.com.au/heartbeat.png")
  # log entry => "2012-09-13T16:08:59+10:00 site=Main status=down"

s.web("Google", "http://www.google.com/jsapi", 120)
  # log entry => "2012-09-13T16:09:02+10:00 site=Google status=up"

diskfree: track disk free space

A DiskfreeSensor tracks how much disk space is available on a mounted disk partition:

s.diskfree(name, freq=1200)
s.diskfree("/var")
  # log entry => "2012-09-13T16:09:02+10:00 mountpoint=/var available=4576 MB"

diskspace: track the diskspace used by a folder

A DiskspaceSensor tracks how much disk space is being consumed by a set of files (log files or database files) in a folder:

s.diskspace(name, folderPath, freq=1200)
s.diskspace("Database", "/var/postgres/data/")
  # log entry => "2012-09-13T16:09:02+10:00 Database usage=45 MB"
s.diskspace("Logs", "/var/log/", 86400)     # daily frequency
  # log entry => "2012-09-13T20:00:00+10:00 Logs usage=16 MB"

filesize: monitor the size of a file

A FilesizeSensor tracks the size of a given file every 10 minutes:

s.filesize(name, path, freq=1200)
s.filesize("Messages", "/var/log/messages")
  # log entry => "2012-09-13T16:09:02+10:00 Messages filesize=78045 bytes"

ping: check if a server is alive

The PingSensor pings a server to ensure it is accessible through the network

s.ping(name, ipaddr, freq=300)
s.ping("earth", "206.125.172.58")
  # log entry => "2012-09-13T16:09:02+10:00 ping host=earth status=up"

proc: look for a named process

It is often useful to directly check that a process is still running. The ProcSensor finds processes that match a pattern.

s.proc(name, pattern, freq=60)
s.proc("Webserver", "httpd")
  # log entry => "2012-09-13T16:09:02+10:00 proc=Webserver status=running"
  # log entry => "2012-09-13T16:10:30+10:00 proc=Webserver status=stopped"
s.proc("Postgres server", "postgres: writer process")
  # log entry => "2012-09-13T16:10:30+10:00 proc=Postgres server status=running"

ssh: ensure the SSH service is running

Check that a server is accepting SSH connections. The SSHSensor actually runs a harmless command (pwd) on the target server to confirm SSH is working. Obviously to make this work in batch mode, the recmon user needs to have a private key in ~/.ssh/ and add it to ~/.ssh/authorized_keys on the target server.

s.ssh(hostname, user="_recmon", port=22, freq=300)
s.ssh("earth")
  # log entry => "2012-09-13T16:09:02+10:00 SSH at host=earth for user=_recmon status=up"
s.ssh("moon", "richard", 922, 600)
  # log entry => "2012-09-13T16:09:02+10:00 SSH at host=moon for user=richard status=up"

command: run an arbitrary command and report success

The command sensor simply executes the command and reports success (exit code = 0) or failure.

s.command("MySQL server", "/usr/local/bin/mysqladmin ping")
  # log entry => "2012-09-13T16:09:02+10:00 MySQL server status=up"

Why Recmon?

1. Recmon is lightweight

There are several wonderful system monitoring tools (nagios, splunk) but they require a considerable investment in configuration, not to mention dependencies that may be undesired. If you just want to keep an eye on a few key metrics then Recmon is much faster and easier to set up.

2. Recmon complements REC

Recmon generates log entries for a range of events that are not typically logged. Once they are logged, they can be analysed to generate alerts.

REC (Ruby Event Correlation) is the tool that correlates events across time to determine if a situation is abnormal, and so generates fewer, more meaningful alerts by email or instant message.

So Recmon + REC = lightweight Nagios