Ruby Event Correlation

Correlates events in order to generate a smaller set of more meaningful events.

Installation

  • Install the gem

    $ sudo gem install rec
    
  • Select a ruleset or create your own

    require 'rec'
    include REC
    require 'rulesets/postfix-rules'
    Correlator::start()
    
  • Start it up

    $ rulesets/rules.rb < /var/log/mail.log 3>missed.log 2>control.log > newevents.log
    
  • Use mplex.rb to merge event streams if desired

    $ cd /var/log
    $ mplex -i messages -i deamon -i authlog -o node4 -o node5
    

This would take input lines from each input and write out to two log files, named node4 and node5 (which may be used by other REC processes). See the file.MULTIPLEX.html for more details.

Why correlate events?

We all know that we should read our log files. But reading log files is really boring, and frankly its easy to miss important things in all the superfluous detail.

Save time

If you are lazy enough to not want to review all of your log files manually forever, and smart enough to work out what needs monitoring and when you might want to pay attention, then wouldn't it be good if you could define those rules and let the computer do what it does best?

Aside

Like most computer techie people, I'll happily spend 6 hours trying to figure out how to do a 3 hour job in 10 minutes. – Rev. James Cort, ASR

Generate meaning

The logs of many applications are filled with entries that are quite low level - perhaps wonderful for debugging, but typically not terribly meaningful in terms of business. Wouldn't it be good if we could summarise a bunch of low level events into a single business event - and then just read the business log.

Alternatives

There are several alternatives to REC which may suit your needs better:

While I like these options, I find they take a lot of configuring. They also have some dependencies that make them a bit heavier than you may want. If you just want to keep track of a few kinds of events, want a lot of flexibility and control without too much effort, then REC may be of some value.

How does REC work?

Each entry in a log file is an event. The Correlator reads the events, and attempts to match an event against each Rule. If an event matches a rule, the rule creates a State which just means we're remembering that the event matched a rule. The pattern to match is a regexp, and the captured values are named. For example

# log entry => "nfs: server earth not responding"
pattern => /nfs\: server (\w+) not responding/
details => ['host']
# values of interest are captured into a hash => {'host' => 'earth' }
:message => "Server %host$s is down"
# interpolation with named parameters => "Server earth is down"

A state has a fixed lifetime, set when it is created. At the end of its life, it may simply expire quietly, or a pre-defined action may be executed. For example, if we find a server is down, we may want to wait for 3 minutes and if it is not up again, then alert the administrator. The server being down is a state, and two states are distinguished by their titles. For example, “host earth is down” and “host terra is down”.

Now that we're remembering a set of states, we can match events against not only the event's message, but also other states. For example, we can match “host terra is up” against a previously created state “host terra is down”, and generate a new event “host terra is back up after 14 minutes”. We can also 'swallow' all of the rest of the “host terra is down” events because they add nothing new. This event compression means the administrator gets one important message, and not 27 distracting alerts.

A notification can be sent by email or instant message, depending on your preferences and working hours. The destinations and credentials are supplied to your ruleset:

# For better security, move the next few lines into a file readable only by
# the user running this script eg. /home/rec/alert.conf
# and then require that file
Notify.smtp_credentials("[email protected]", "recret", "myfirm.com")
Notify.emailTo = "[email protected]"
Notify.jabber_credentials("[email protected]", "recret")
Notify.jabberTo = "[email protected]"

Rules can then send an alert when desired. Two common cases involve alerting immediately on the first event (eg. “host terra is down”), and alerting on expiry or at a subsequent event (eg. “host terra is back up”).

state.alert_first_only()                # => generate a new event on first original event
                                        # or
Notify.normal(state.alert_first_only()) # => log and also send the new event via email

In most cases, however, it is not necessary to alert the administrator at all. It is enough to log the new event in the output logfile for later review.

Anatomy of a Rule

Warn if an user is having trouble executing sudo commands The log entry (/var/log/secure) looks like this:

Sep 16 07:09:22 earth sudo:  richard : 3 incorrect password attempts ;...

and the rule might look like this:

# single threshold rule
Rule.new(10034, {
  :pattern => /\w+ sudo\:  (\w+) \: 3 incorrect password attempts/,
  :details => ["userid"],
  :message => "Failed sudo password for user %userid$s",
  :lifespan => 60,
  :alert => "'Too much sudo activity' userid=%userid$s attempts=%count$d dur=%dur$0.3fs ",
  :threshold => 3,
  :capture => true
}) { |state|
if state.count == state.threshold
  Notify.urgent(state.generate_alert())
  state.release()
end
}

Let's look at each part:

rule ID

Each rule must have a unique integer ID (rid). It is the first argument and is mandatory. Its probably a good idea to 'reserve' a number range for a ruleset to keep them separate from other rules (eg. 17801-17899 for Postfix-related rules).

The second argument is a hash of options:

pattern

The pattern is a regexp designed to match certain log messages. A message is what's left of a log entry after we have removed the timestamp and any priority level. For example:

[Thu Aug 16 16:11:21 2012] [error] ap_proxy_connect_backend disabling worker for (127.0.0.1)
# timestamp parsed => 2012-08-16T16:11:21+10:00
# priority ignored => "error"
# message => "ap_proxy_connect_backend disabling worker for (127.0.0.1)"

details

The pattern may contain regexp 'captures' (eg. (d+.d+.d+.d+) to capture the ip). For each capture a name should be specified in the details array. The sequence of captures is as specified for ruby Regexps.

:pattern => /\w+ sudo\:  (\w+) \: (\d) incorrect password attempts/,
:details => ["userid", "failures"],

The names chosen for captured values are used as keys to store the values in the same hash that stores the parameters, so do not choose the following keywords: pattern, details, message, threshold, lifespan, alert, capture, continue, or action.

message

The message is a string template into which the captured values are interpolated to produce a unique key for a state.

:details => ["userid"],
:message => "Failed sudo password for user %userid$s",
# userid = "richard" => "Failed sudo password for user richard"

Note the modified sprintf syntax: the value of userid is inserted into the message as a string by the String::sprinth method. This becomes the title and key for the state created by this rule.

lifespan

When a rule creates a state, we need to know how long to remember the state for, and when to expire it. The lifespan specifies that duration in seconds.

It is also possible to extend the life of a state should other events take place (with State::live_another) in the same way that a web session may be extended for another 10 minutes longer at each request.

alert

This is a string template used to generate an output log message (the timestamp will be prefixed automatically to complete the log entry).

:alert => "'Too much sudo activity' userid=%userid$s attempts=%count$d dur=%dur$0.3fs ",

By convention, we make our log messages very easy to parse by creating name=value pairs, and single-quoting strings containing spaces, in case the output will be processed further.

If no alert is provided, it will default to message.

capture

The capture parameters tells REC to store the original log entries in the state (in the logs attribute). You could in this way for example extract a transcript of each web session from a noisy access log, and output them as each session finishes or expires.

:capture => true

threshold

This parameter is used in the action.

:threshold => 3

allstates

An array of templates used to determine if matching states exist. All the mentioned states must be found or the rule will not take any action.

anystates

An array of templates used to determine if matching states exist. If any one of the mentioned states exist, then the rule will execute its action.

notstates

An array of templates used to determine if matching states exist. If any one of the mentioned states does exist, then the rule will not execute its action.

Any arbitrary parameter

Any arbitrary parameters may be added to the rule, and they are passed on to the state in the params hash.

action

The action is a block with a single argument which is the state created by the rule. The count of matched events is maintained automatically. In this case, when we have seen 3 events, we generate an output log entry and also send it by IM, then release the state (forget about it).

Rule.new(10045, {
  #...
  :threshold => 3
 }) { |state|
if state.count == state.threshold
  Notify.urgent(state.generate_alert())
  state.release()
end
}

By the magic of Ruby's #method_missing method (Yes, I'm looking at you Java!) we can refer to any parameter succinctly instead of a cumbersome hash notation, so:

state.threshold === state.params['threshold']

For more examples, see the file.EXAMPLES.html page. For an example of how to execute a ruleset, see file.RUNNING page.