Rawk
An awk-inspired ruby DSL
Last week, after years of ignoring awk, I ran into a shell script problem where it was the only viable solution (we didn’t have ruby on the server) and I was forced to learn a bit more about it.
Once, I had awk figured out, I thought it would be fun to write an awk DSL using ruby. It’s turned out to be quite an interesting little project for the daily train ride to work and back.
Obviously, you can use ruby -e and other magic to execute snippets of ruby, but I like the way awk provides a bit more structure and a richer environment for more complex command line mangling.
Install
From the command line
gem install rawk
Using bundler
gem "rawk", "~> 0.1.2"
Example
A simple awk program
$ ls -ltr | awk '
BEGIN {print "Starting..."}
{print $9, $1}
END {print "done"} '
Creates the following output
Starting...
total
spec drwxr-xr-x
lib drwxr-xr-x
bin drwxr-xr-x
README -rw-r--r--
done
This can be written using rawk as
$ ls -ltr | rawk '
start {puts "Starting..."}
every {|record| puts "#{record.cols[8]} #{record.cols[0]}"}
finish {puts "done"} '
And it also creates the same output
Starting...
total
spec drwxr-xr-x
lib drwxr-xr-x
bin drwxr-xr-x
README -rw-r--r--
done
Notice that the structure and semantics of an awk program is preserved and you use normal ruby code to process the input stream. I’ve had to bend the knee to the ruby interpreter and change the syntax slightly but I think it actually makes rawk programs a bit clearer than awk.
Details descriptions are shown below. I’m assuming you have a working knowledge of awk. Wikipedia provides an easy primer if you need to brush up.
Conditions and blocks
rawk provides 3 built-in conditions.
start {<code>}
Runs before any lines are read from the input stream. Equivalent to a BEGIN condition in awk
every {|record| <code>}
Runs once for each line of input data. Yields an object of type Record (see below) Equivalent to an anonymous block such as awk ‘$1’
finish {<code>}
Runs after the end of the input stream Equivalent to an END condition in awk
You can provide multiple blocks of code for each condition.
ls -ltr | head -2 | rawk '
every {|record| puts 1}
every {|record| puts 2} '
prints
1
2
1
2
Not supported (yet)
-
Conditional blocks
Records
every yields an object of type Record which is subclass of String that adds a cols method to access columns. The cols method returns an array of column values.
echo "hello world" | rawk 'every do |record|
puts "#{record.cols.length} columns: #{record.cols.join(",")}"
end'
-> 2 columns: hello,world
Note that cols is aliased to c for convenience
echo "hello world" | rawk 'every do |record|
puts record.c[0]
end'
-> hello
In most cases you will be dealing with a few columns of data so Record provides functions that allow you to access columns the first 10 columns directly by position name.
echo hello world from me | rawk 'every {|r| puts "#{r.first} #{r.third}"}'
-> hello from
Functions, classes and other ruby stuff
You can use ruby as normal. For example…
Functions
echo hello world | rawk '
def print_first_column(record)
puts record.cols.first
end
every {|record| print_first_column(record)}'
Classes
echo hello world | rawk '
class Printer
def self.print_first(record)
puts record.cols.first
end
end
every {|record| Printer.print_first(record)} '
Requires and gems
require works as you would expect although rubygems is not required by default.
echo "ruby" | rawk '
require "rubygems"
require "active_support/all"
every {|record| puts record.cols.first.pluralize} '
-> rubies
Variables and Scope
Variables defined inside the condition blocks (start, every, finish) are local to that block. Create a member variables to share state between blocks.
ls | tail -2 | rawk '
start do
local = "foo"
@shared = "bar"
puts "Starting with #{local}"
end
every {|record| puts "Running with #{@shared}"} '
-> Starting with foo
Running with
Running with
Builtins
rawk provides builtins as member variables. You can change them as you see fit.
@nr holds the current record number
ls -ltr | head -2 | rawk 'every {puts @nr}'
@fs specifies the field separator applied to each record
echo "foo.bar" | rawk '
start {@fs="."}
every {|record| puts "1: #{record.cols[0]} 2: #{record.cols[1]}"} '
-> 1: foo 2:
@rs specifies the record separator“ character
* Defaults to newline
* Note that, unlike awk, @rs can only be set in the start block. It cannot be changed "in flight"
ksh print -n "foo.bar." | bin/rawk '
start {@rs = "."}
every {|r| puts r.cols.first} '
-> foo
NF: Keeps a count of the number of fields in an input record. The last field in the input record can be designated by $NF.
-
Each Record yielded by the every block has a ‘.nf’ method
-
$NF can be coded as ‘every {|record| record.cols.last}’
echo "foo bar" | rawk 'every {|record| puts "#{record.nf} fields"}'
-> 2 fields
Not supported (yet)
I’m working on support for the following awk built-ins
FILENAME: Contains the name of the current input-file.
-
Reading input data is not supported yet
-
When I add it, I’ll add @filename as a member
Redundant
The following awk built-ins are redundant in ruby
OFS: Stores the “output field separator”, which separates the fields when Awk prints them. The default is a “space” character.
-
Ruby’s string handling is far superior to awk’s so there is no point in implementing a print routine
ORS: Stores the “output record separator”, which separates the output records when Awk prints them. The default is a “newline” character.
-
You already have complete control of the output stream. If you don’t want newlines, use print or printf instead of puts
OFMT: Stores the format for numeric output. The default format is “%.6g”.
-
Ruby’s string and number handing gives you much better control over this sort of thing
Using rawk inside a ruby program
Rawk code is evaluated within an instance of Rawk::Program. You can use rawk within your programs as follows…
require 'rubygems'
require 'rawk'
data = "foo\nbar"
program = Rawk::Program.new(data)
program.run do
every {|record| puts record.cols.first}
end