Module: IMW

Defined in:
lib/imw.rb,
lib/imw/boot.rb,
lib/imw/runner.rb,
lib/imw/dataset.rb,
lib/imw/parsers.rb,
lib/imw/resource.rb,
lib/imw/resources.rb,
lib/imw/utils/log.rb,
lib/imw/repository.rb,
lib/imw/transforms.rb,
lib/imw/utils/misc.rb,
lib/imw/utils/error.rb,
lib/imw/utils/paths.rb,
lib/imw/dataset/paths.rb,
lib/imw/utils/version.rb,
lib/imw/resources/local.rb,
lib/imw/dataset/workflow.rb,
lib/imw/resources/remote.rb,
lib/imw/utils/extensions.rb,
lib/imw/resources/archive.rb,
lib/imw/resources/formats.rb,
lib/imw/resources/schemes.rb,
lib/imw/parsers/html_parser.rb,
lib/imw/parsers/line_parser.rb,
lib/imw/transforms/archiver.rb,
lib/imw/resources/schemes/s3.rb,
lib/imw/parsers/regexp_parser.rb,
lib/imw/transforms/transferer.rb,
lib/imw/resources/compressible.rb,
lib/imw/resources/formats/json.rb,
lib/imw/resources/formats/sgml.rb,
lib/imw/resources/formats/yaml.rb,
lib/imw/resources/schemes/hdfs.rb,
lib/imw/resources/schemes/http.rb,
lib/imw/resources/compressed_file.rb,
lib/imw/resources/formats/delimited.rb,
lib/imw/parsers/html_parser/matchers.rb,
lib/imw/resources/archives_and_compressed.rb,
lib/imw/resources/archives_and_compressed/gz.rb,
lib/imw/resources/archives_and_compressed/bz2.rb,
lib/imw/resources/archives_and_compressed/rar.rb,
lib/imw/resources/archives_and_compressed/tar.rb,
lib/imw/resources/archives_and_compressed/zip.rb,
lib/imw/resources/archives_and_compressed/targz.rb,
lib/imw/resources/archives_and_compressed/tarbz2.rb

Overview

The Infinite Monkeywrench (IMW) is a Ruby library for ripping, extracting, parsing, munging, and packaging datasets. It allows you to handle different data formats transparently as well as organize transformations of data as a network of dependencies (a la Make or Rake).

IMW has a few central concepts: resources, datasets, workflows, and repositories.

Resources represent individual data resources like local files, websites, databases, &c. Resources are typically instantiated via IMW.open, with IMW doing the work of figuring out what to return based on the URI passed in.

Datasets represent collections of related data resources. An IMW::Dataset comes with a pre-defined (but customizable) workflow that takes data resources through several steps: rip, parse, munge, and package. The workflow leverages Rake and so the various tasks that are necessary to process the data till it is nice and pretty can all be linked with dependencies.

Repositories are collections of datasets and it is on these collections that the imw command line tool operates.

Defined Under Namespace

Modules: Config, Parsers, Paths, Resources, Transforms, VERSION, Workflow Classes: Counter, Dataset, Repository, Resource, Runner, SystemCallError

Constant Summary collapse

RunnerError =
Class.new(IMW::Error)
LOG_FILE_DESTINATION =

Default log file.

STDERR
LOG_TIMEFORMAT =
"%Y%m%d-%H:%M:%S "
PROGRESS_TRACKERS =
{}
PROGRESS_COUNTERS =
{}
Error =

Base error class which all IMW errors subclass.

Class.new(StandardError)
NoMethodError =

Method undefined.

Class.new(Error)
TypeError =

Type error.

Class.new(Error)
NotImplementedError =

Not implemented (typically because user needs to define a method when subclassing a base class).

Class.new(Error)
ParseError =

Error during parsing.

Class.new(Error)
PathError =

Error with a non-existing, invalid, or inaccessible path.

Class.new(Error)
NetworkError =

Error communicating with a remote entity.

Class.new(Error)
ArgumentError =

Error communicating with a remote entity.

Class.new(Error)
DEFAULT_PATHS =

Default paths for the IMW. Chosen to make sense on most *NIX distributions.

{
  :home         => ENV['HOME'],
  :data_root    => "/var/lib/imw",
  :log_root     => "/var/log/imw",
  :scripts_root => "/usr/share/imw",
  :tmp_root     => "/tmp/imw",

  # the imw library
  :imw_root  => File.expand_path(File.dirname(__FILE__) + "/../../.."),
  :imw_bin   => [:imw_root, 'bin'],
  :imw_etc   => [:imw_root, 'etc'],
  :imw_lib   => [:imw_root, 'lib'],

  # workflow
  :ripd_root  => [:data_root, 'ripd'],
  :rawd_root  => [:data_root, 'rawd'],
  :fixd_root  => [:data_root, 'fixd'],
  :pkgd_root  => [:data_root, 'pkgd']
}
Task =

An IMW version of Rake::Task

Class.new(Rake::Task)
FileTask =

An IMW subclass of Rake:FileTask

Class.new(Rake::FileTask)
FileCreationTask =

An IMW subclass of Rake::FileCreationTask

Class.new(Rake::FileCreationTask)
COMPRESSION_SETTINGS =

Default settings used when compressing files. :program defines the name of the command-line program to use, :compress gives the flags to use when compressing, and :extension gives the extension (without the ‘.’) added by the program after compressing.

{
  :program   => 'bzip2',
  :compress  => '',
  :extension => 'bz2'
}

Class Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Class Attribute Details

.logObject

Returns the value of attribute log.



10
11
12
# File 'lib/imw/utils/log.rb', line 10

def log
  @log
end

Class Method Details

.add_path(sym, *pathsegs) ⇒ String

Adds a symbolic path for expansion by path_to.

IMW.add_path :foo, '~/whoa'
IMW.add_path :bar, :foo,   'baz'
IMW.path_to :bar
=> '~/whoa/baz'

Parameters:

  • sym (Symbol)

    the name of the path to store

  • pathsegs (Symbol, String)

    the path segments to use to define the path to the name

Returns:

  • (String)

    the resulting path



122
123
124
125
# File 'lib/imw/utils/paths.rb', line 122

def self.add_path sym, *pathsegs
  IMW::PATHS[sym] = pathsegs.flatten
  path_to[sym]
end

.dataset(handle, options = {}, &block) ⇒ IMW::Dataset

Create a dataset and put it in the default IMW repository. Also yields the dataset so you can define its workflow

IMW.dataset :my_dataset do

# Define some paths we're going to use
add_path :raw_data,  :ripd, 'raw_data.csv'
add_path :fixd_data, :fixd, 'fixed_data.csv'

# Copy a file from a website to this dataset's +ripd+ directory.
rip do
  IMW.open('http://mysite.com/data_archives/2010/03/03.csv').cp(path_to(:raw_data))
end

# Filter the raw data to those values which match some criterion defined by <tt>accept?</tt>
munge do
  IMW.open(path_to(:raw_data)).map do |row|
    row if accept?(row)
  end.compact.dump(path_to(:fixd_data))
end

# Compress this new data
package do
  IMW.open(path_to(:fixd_data)).compress.mv(path_to(:pkgd))
end

end

Parameters:

  • handle (Symbol, String)

    the handle to identify this dataset with

  • options (Hash) (defaults to: {})

    a hash of options (see IMW::Dataset)

Returns:



100
101
102
103
104
# File 'lib/imw.rb', line 100

def self.dataset handle, options={}, &block
  d = IMW::Dataset.new(handle, options)
  d.instance_eval(&block) if block_given?
  d
end

.instantiate_logger!Object

Create a Logger and point it at IMW::LOG_FILE_DESTINATION which is set in ~/.imwrc and defaults to STDERR.



14
15
16
17
18
# File 'lib/imw/utils/log.rb', line 14

def self.instantiate_logger!
  IMW.log ||= Logger.new(LOG_FILE_DESTINATION)
  IMW.log.datetime_format = "%Y%m%d-%H:%M:%S "
  IMW.log.level           = Logger::INFO
end

.open(obj, options = {}) ⇒ IMW::Resource

Open a resource at the given uri. The resource will automatically be extended by modules which make sense given the uri.

See the documentation for IMW::Resource and the various modules within IMW::Resources for more information and options.

Passing in an IMW::Resource will simply return it.

Parameters:

Returns:

  • (IMW::Resource)

    the resulting resource, property extended for the given URI



47
48
49
50
# File 'lib/imw.rb', line 47

def self.open obj, options={}
  return obj if obj.is_a?(IMW::Resource)
  IMW::Resource.new(obj, options)
end

.open!(uri, options = {}) ⇒ IMW::Resource

Works the same way as IMW.open except opens the resource for writing.

Parameters:

Returns:

  • (IMW::Resource)

    the resultng resource, properly extended for the given URI and opened for writing.



57
58
59
# File 'lib/imw.rb', line 57

def self.open! uri, options={}
  IMW::Resource.new(uri, options.merge(:mode => 'w'))
end

.path_to(*pathsegs) ⇒ String

Expands a shorthand workflow path specification to an actual file path. Strings are interpreted literally but symbols are first resolved to the paths they represent.

IMW.add_path :foo, '~/whoa'
IMW.path_to :foo, 'my_thing'
=> '~/whoa/my_thing'

Parameters:

Returns:

  • (String)

    the resulting expanded path



107
108
109
110
# File 'lib/imw/utils/paths.rb', line 107

def self.path_to *pathsegs
  path = Pathname.new IMW.path_to_helper(*pathsegs)
  path.absolute? ? File.expand_path(path) : path.to_s
end

.remove_path(sym) ⇒ Object

Removes a symbolic path for expansion by path_to.

Parameters:

  • sym (Symbol)

    the stored path symbol to remove



130
131
132
# File 'lib/imw/utils/paths.rb', line 130

def self.remove_path sym
  IMW::PATHS.delete sym if IMW::PATHS.include? sym
end

.repositoryIMW::Repository

The default repository in which to place datasets. See the documentation for IMW::Repository for more information on how datasets and repositories fit together.

Returns:



66
67
68
# File 'lib/imw.rb', line 66

def self.repository
  @@repository ||= IMW::Repository.new
end

.system(*commands) ⇒ Object

A replacement for the standard system call which raises an IMW::SystemCallError if the command fails which prints better debugging info.

This function relies upon Kernel.system and obeys the same rules:

  • if commands has only only a single element then no shell characters or spaces are escaped – you have to do it yourself or you get to use shell characters, depending on your perspective.

  • if commands is a list of elements then the second and further elements in the list have their shell characters and spaces escaped

But it also has its own rules:

  • When one of the commands is an empty or blank string, Kernel.system honors it and escapes it properly and sends it along for evaluation. This can be a problem for some programs and so IMW.system excludes blank (as in blank?) elements of commands.

  • commands will be flattened (see the gotcha below)

Calling out to the shell like this is often brittle. Imagine defining

prog  = 'some_prog'
flags = '-v -f'
args  = 'file.txt'

and later calling

IMW.system prog, flags, args

The space in the second argument (‘-v -f’) will be escaped and will therefore not be properly parsed by some_prog. Instead try

prog  = 'some_prog'
flags = ['-v', '-f']
args = ['file.txt']

IMW.system prog, flags, *args

which will work fine since flags will automatically be flattend.



58
59
60
61
62
# File 'lib/imw/utils/extensions.rb', line 58

def self.system *commands
  stripped_commands = commands.flatten.map { |command| command.to_s unless command.blank? }.compact
  Kernel.system(*stripped_commands)
  raise IMW::SystemCallError.new($?.dup, commands.join(' ')) unless $?.success?
end

Instance Method Details

#announce(*events) ⇒ Object



20
21
22
23
24
# File 'lib/imw/utils/log.rb', line 20

def announce *events
  options = events.flatten.extract_options!
  options.reverse_merge! :level => Logger::INFO
  IMW.log.add options[:level], events.join("\n")
end


25
26
27
28
29
# File 'lib/imw/utils/log.rb', line 25

def banner *events
  options = events.flatten.extract_options!
  options.reverse_merge! :level => Logger::INFO
  announce(["*"*75, events, "*"*75], options)
end

#track_count(tracker, every = 1000) ⇒ Object

Log repetitions in a given context

At every n’th (default 1000) call, announce progress in the IMW.log



60
61
62
63
64
65
# File 'lib/imw/utils/log.rb', line 60

def track_count tracker, every=1000
  PROGRESS_COUNTERS[tracker] ||= 0
  PROGRESS_COUNTERS[tracker]  += 1
  chunk = every * (PROGRESS_COUNTERS[tracker]/every).to_i
  track_progress "count_of_#{tracker}", chunk
end

#track_progress(tracker, val) ⇒ Object

When the slowly-changing tracked variable var changes value, announce its new value. Always announces on first call.

Ex:

track_progress :indexing_names, name[0..0] # announce at each initial letter
track_progress :files, (i % 1000)          # announce at each 1,000 iterations


45
46
47
48
49
50
51
# File 'lib/imw/utils/log.rb', line 45

def track_progress tracker, val
  unless (IMW::PROGRESS_TRACKERS.include?(tracker)) &&
         (IMW::PROGRESS_TRACKERS[tracker] == val)
    announce "#{tracker.to_s.gsub(/_/,' ')}: #{val}"
    IMW::PROGRESS_TRACKERS[tracker] = val
  end
end

#warn(*events) ⇒ Object



30
31
32
33
34
# File 'lib/imw/utils/log.rb', line 30

def warn *events
  options = events.flatten.extract_options!
  options.reverse_merge! :level => Logger::WARN
  announce events, options
end