Recmon
Recmon is a host-based system monitor. It complements REC, the Ruby Event Correlator.
Installation
$ sudo gem install recmon
Usage
Require the recmon gem:
require 'rubygems'
require 'recmon'
Create a monitor to obtain readings from the sensors at the right time:
s = Recmon::Monitor.new()
or customise the monitor with a different log file, process name, and frequency:
s = Recmon::Monitor.new("/var/log/my.log", "mymonitor", 25)
# appears in process list as => "ruby: mymonitor"
Now add sensors for each aspect you want to monitor:
s.web("Website", "http://www.example.com/index.html")
s.diskspace("Database", "/var/postgres/data/")
s.filesize("Messages", "/var/log/messages")
s.ping("earth", "206.125.172.58")
Then start the monitor running:
s.start()
and it will periodically write entries to the log file (default = /var/log/recmon.log
):
2012-09-13T16:08:59+10:00 Recmon is monitoring.
...
2012-09-13T16:08:59+10:00 site=Website status=down
2012-09-13T16:08:59+10:00 Database usage=247 MB
2012-09-13T16:08:59+10:00 Messages filesize=83 KB
2012-09-13T16:08:59+10:00 ping host=earth status=up
...
2012-09-13T16:09:16+10:00 Recmon is exiting.
Sensors
There are several sensors:
-
web: ensure a website is responding
-
diskfree: track disk free space
-
diskspace: track the diskspace used by a folder
-
filesize: monitor the size of a file
-
ping: check if a server is alive
-
proc: look for a named process
-
ssh: ensure the SSH service is running
-
command: run an arbitrary command and report success
Common characteristics
All sensors have a name which is used to distinguish it from others of the same type. For example, you can monitor the earth
server and the terra
server. The name is used in composing the log entry.
All sensors have a sane default frequency, so it is not necessary to specify a frequency unless you want to override the default.
Sensors compose log messages that is designed for further processing. Although they are readable, it is more important that they are parsable, so they always start with an ISO8601 date time and then several name
=value
pairs.
web: Ensure a website is responding
To periodically check if a website is reachable, add a WebSensor, specifying the title, the URL, and optionally a frequency (default = 60 seconds).
s.web(name, url, freq=60)
s.web("Main", "http://www.finalstep.com.au/heartbeat.png")
# log entry => "2012-09-13T16:08:59+10:00 site=Main status=down"
s.web("Google", "http://www.google.com/jsapi", 120)
# log entry => "2012-09-13T16:09:02+10:00 site=Google status=up"
diskfree: track disk free space
A DiskfreeSensor tracks how much disk space is available on a mounted disk partition:
s.diskfree(name, freq=1200)
s.diskfree("/var")
# log entry => "2012-09-13T16:09:02+10:00 mountpoint=/var available=4576 MB"
diskspace: track the diskspace used by a folder
A DiskspaceSensor tracks how much disk space is being consumed by a set of files (log files or database files) in a folder:
s.diskspace(name, folderPath, freq=1200)
s.diskspace("Database", "/var/postgres/data/")
# log entry => "2012-09-13T16:09:02+10:00 Database usage=45 MB"
s.diskspace("Logs", "/var/log/", 86400) # daily frequency
# log entry => "2012-09-13T20:00:00+10:00 Logs usage=16 MB"
filesize: monitor the size of a file
A FilesizeSensor tracks the size of a given file every 10 minutes:
s.filesize(name, path, freq=1200)
s.filesize("Messages", "/var/log/messages")
# log entry => "2012-09-13T16:09:02+10:00 Messages filesize=78045 bytes"
ping: check if a server is alive
The PingSensor pings a server to ensure it is accessible through the network
s.ping(name, ipaddr, freq=300)
s.ping("earth", "206.125.172.58")
# log entry => "2012-09-13T16:09:02+10:00 ping host=earth status=up"
proc: look for a named process
It is often useful to directly check that a process is still running. The ProcSensor finds processes that match a pattern.
s.proc(name, pattern, freq=60)
s.proc("Webserver", "httpd")
# log entry => "2012-09-13T16:09:02+10:00 proc=Webserver status=running"
# log entry => "2012-09-13T16:10:30+10:00 proc=Webserver status=stopped"
s.proc("Postgres server", "postgres: writer process")
# log entry => "2012-09-13T16:10:30+10:00 proc=Postgres server status=running"
ssh: ensure the SSH service is running
Check that a server is accepting SSH connections. The SSHSensor actually runs a harmless command (pwd
) on the target server to confirm SSH is working. Obviously to make this work in batch mode, the recmon user needs to have a private key in ~/.ssh/
and add it to ~/.ssh/authorized_keys
on the target server.
s.ssh(hostname, user="_recmon", port=22, freq=300)
s.ssh("earth")
# log entry => "2012-09-13T16:09:02+10:00 SSH at host=earth for user=_recmon status=up"
s.ssh("moon", "richard", 922, 600)
# log entry => "2012-09-13T16:09:02+10:00 SSH at host=moon for user=richard status=up"
command: run an arbitrary command and report success
The command sensor simply executes the command and reports success (exit code = 0) or failure.
s.command("MySQL server", "/usr/local/bin/mysqladmin ping")
# log entry => "2012-09-13T16:09:02+10:00 MySQL server status=up"
Why Recmon?
1. Recmon is lightweight
There are several wonderful system monitoring tools (nagios, splunk) but they require a considerable investment in configuration, not to mention dependencies that may be undesired. If you just want to keep an eye on a few key metrics then Recmon is much faster and easier to set up.
2. Recmon complements REC
Recmon generates log entries for a range of events that are not typically logged. Once they are logged, they can be analysed to generate alerts.
REC (Ruby Event Correlation) is the tool that correlates events across time to determine if a situation is abnormal, and so generates fewer, more meaningful alerts by email or instant message.
So Recmon + REC = lightweight Nagios