Rawk

An awk-inspired ruby DSL

Last week, after years of ignoring awk, I ran into a shell script problem where it was the only viable solution (we didn’t have ruby on the server) and I was forced to learn a bit more about it.

Once, I had awk figured out, I thought it would be fun to write an awk DSL using ruby. It’s turned out to be quite an interesting little project for the daily train ride to work and back.

Obviously, you can use ruby -e and other magic to execute snippets of ruby, but I like the way awk provides a bit more structure and a richer environment for more complex command line mangling.

Install

From the command line

gem install rawk

Using bundler

gem "rawk", "~> 0.1.2"

Example

A simple awk program

$ ls -ltr | awk '
  BEGIN {print "Starting..."}
  {print $9, $1}
  END {print "done"} '

Creates the following output

Starting...
 total
spec drwxr-xr-x
lib drwxr-xr-x
bin drwxr-xr-x
README -rw-r--r--
done

This can be written using rawk as

$ ls -ltr | rawk '
  start {puts "Starting..."}
  every {|record| puts "#{record.cols[8]} #{record.cols[0]}"}
  finish {puts "done"} '

And it also creates the same output

Starting...
 total
spec drwxr-xr-x
lib drwxr-xr-x
bin drwxr-xr-x
README -rw-r--r--
done

Notice that the structure and semantics of an awk program is preserved and you use normal ruby code to process the input stream. I’ve had to bend the knee to the ruby interpreter and change the syntax slightly but I think it actually makes rawk programs a bit clearer than awk.

Details descriptions are shown below. I’m assuming you have a working knowledge of awk. Wikipedia provides an easy primer if you need to brush up.

Conditions and blocks

rawk provides 3 built-in conditions.

start {<code>}

Runs before any lines are read from the input stream. Equivalent to a BEGIN condition in awk

every {|record| <code>}

Runs once for each line of input data. Yields an object of type Record (see below) Equivalent to an anonymous block such as awk ‘$1’

finish {<code>}

Runs after the end of the input stream Equivalent to an END condition in awk

You can provide multiple blocks of code for each condition.

ls -ltr | head -2 | rawk '
  every {|record| puts 1}
  every {|record| puts 2} '

prints

1
2
1
2

Not supported (yet)

  • Conditional blocks

Records

every yields an object of type Record which is subclass of String that adds a cols method to access columns. The cols method returns an array of column values.

echo "hello world" | rawk 'every do |record| 
  puts "#{record.cols.length} columns: #{record.cols.join(",")}"
end'

-> 2 columns: hello,world

Note that cols is aliased to c for convenience

echo "hello world" | rawk 'every do |record| 
  puts record.c[0]
end'

-> hello

In most cases you will be dealing with a few columns of data so Record provides functions that allow you to access columns the first 10 columns directly by position name.

echo hello world from me | rawk 'every {|r| puts "#{r.first} #{r.third}"}'

-> hello from

Functions, classes and other ruby stuff

You can use ruby as normal. For example…

Functions

echo hello world | rawk '
  def print_first_column(record)
    puts record.cols.first
  end
  every {|record| print_first_column(record)}'

Classes

echo hello world | rawk '
  class Printer
    def self.print_first(record)
      puts record.cols.first
    end
  end
  every {|record| Printer.print_first(record)} '

Requires and gems

require works as you would expect although rubygems is not required by default.

echo "ruby" | rawk '
  require "rubygems"
  require "active_support/all"
  every {|record| puts record.cols.first.pluralize} '

-> rubies

Variables and Scope

Variables defined inside the condition blocks (start, every, finish) are local to that block. Create a member variables to share state between blocks.

ls | tail -2 | rawk '
  start do
    local = "foo"
    @shared = "bar"
    puts "Starting with #{local}"
  end
  every {|record| puts "Running with #{@shared}"} '

-> Starting with foo
   Running with bar
   Running with bar

Builtins

rawk provides builtins as member variables. You can change them as you see fit.

@nr holds the current record number

ls -ltr | head -2 | rawk 'every {puts @nr}'

@fs specifies the field separator applied to each record

echo "foo.bar" | rawk ' 
  start {@fs="."}
  every {|record| puts "1: #{record.cols[0]} 2: #{record.cols[1]}"} '

-> 1: foo 2: bar

@rs specifies the record separator“ character

* Defaults to newline
* Note that, unlike awk, @rs can only be set in the start block.  It cannot be changed "in flight"

ksh print -n "foo.bar." | bin/rawk '
  start {@rs = "."}
  every {|r| puts r.cols.first} '

-> foo
   bar

NF: Keeps a count of the number of fields in an input record. The last field in the input record can be designated by $NF.

  • Each Record yielded by the every block has a ‘.nf’ method

  • $NF can be coded as ‘every {|record| record.cols.last}’

echo "foo bar" | rawk 'every {|record| puts "#{record.nf} fields"}'

-> 2 fields

Not supported (yet)

I’m working on support for the following awk built-ins

FILENAME: Contains the name of the current input-file.

  • Reading input data is not supported yet

  • When I add it, I’ll add @filename as a member

Redundant

The following awk built-ins are redundant in ruby

OFS: Stores the “output field separator”, which separates the fields when Awk prints them. The default is a “space” character.

  • Ruby’s string handling is far superior to awk’s so there is no point in implementing a print routine

ORS: Stores the “output record separator”, which separates the output records when Awk prints them. The default is a “newline” character.

  • You already have complete control of the output stream. If you don’t want newlines, use print or printf instead of puts

OFMT: Stores the format for numeric output. The default format is “%.6g”.

  • Ruby’s string and number handing gives you much better control over this sort of thing

Using rawk inside a ruby program

Rawk code is evaluated within an instance of Rawk::Program. You can use rawk within your programs as follows…

require 'rubygems'
require 'rawk'

data = "foo\nbar"
program = Rawk::Program.new(data)

program.run do
  every {|record| puts record.cols.first}
end