Module: IMW

Defined in:
lib/imw.rb,
lib/imw/boot.rb,
lib/imw/tools.rb,
lib/imw/utils.rb,
lib/imw/runner.rb,
lib/imw/dataset.rb,
lib/imw/formats.rb,
lib/imw/parsers.rb,
lib/imw/schemes.rb,
lib/imw/archives.rb,
lib/imw/metadata.rb,
lib/imw/resource.rb,
lib/imw/utils/log.rb,
lib/imw/repository.rb,
lib/imw/schemes/s3.rb,
lib/imw/utils/misc.rb,
lib/imw/formats/pdf.rb,
lib/imw/schemes/sql.rb,
lib/imw/utils/error.rb,
lib/imw/utils/paths.rb,
lib/imw/archives/rar.rb,
lib/imw/archives/tar.rb,
lib/imw/archives/zip.rb,
lib/imw/formats/json.rb,
lib/imw/formats/sgml.rb,
lib/imw/formats/yaml.rb,
lib/imw/metadata/dsl.rb,
lib/imw/parsers/flat.rb,
lib/imw/schemes/hdfs.rb,
lib/imw/schemes/http.rb,
lib/imw/dataset/paths.rb,
lib/imw/formats/excel.rb,
lib/imw/schemes/local.rb,
lib/imw/utils/has_uri.rb,
lib/imw/utils/version.rb,
lib/imw/archives/targz.rb,
lib/imw/metadata/field.rb,
lib/imw/schemes/remote.rb,
lib/imw/tools/archiver.rb,
lib/imw/archives/tarbz2.rb,
lib/imw/metadata/schema.rb,
lib/imw/compressed_files.rb,
lib/imw/dataset/workflow.rb,
lib/imw/tools/aggregator.rb,
lib/imw/tools/downloader.rb,
lib/imw/tools/summarizer.rb,
lib/imw/tools/transferer.rb,
lib/imw/utils/extensions.rb,
lib/imw/formats/delimited.rb,
lib/imw/compressed_files/gz.rb,
lib/imw/parsers/html_parser.rb,
lib/imw/parsers/line_parser.rb,
lib/imw/compressed_files/bz2.rb,
lib/imw/metadata/schematized.rb,
lib/imw/parsers/regexp_parser.rb,
lib/imw/tools/extension_analyzer.rb,
lib/imw/metadata/contains_metadata.rb,
lib/imw/parsers/html_parser/matchers.rb,
lib/imw/utils/dynamically_extendable.rb,
lib/imw/compressed_files/compressible.rb

Overview

The Infinite Monkeywrench (IMW) is a Ruby library for ripping, extracting, parsing, munging, and packaging datasets. It allows you to handle different data formats transparently as well as organize transformations of data as a network of dependencies (a la Make or Rake).

IMW has a few central concepts: resources, metadata, datasets, workflows, and repositories.

Resources represent individual data resources like local files, websites, databases, &c. An IMW::Resource is typically instantiated via IMW.open, with IMW doing the work of figuring out what to return based on the URI passed in.

A Resource can have a schema which describes the fields in its data. IMW::Metadata consists of classes which describe fields.

Datasets represent collections of related data resources .. An IMW::Dataset comes with a pre-defined (but customizable) workflow that takes data resources through several steps: rip, parse, munge, and package. The workflow leverages Rake and so the various tasks that are necessary to process the data till it is nice and pretty can all be linked with dependencies.

Repositories are collections of datasets and it is on these collections that the imw command line tool operates.

Defined Under Namespace

Modules: Archives, CompressedFiles, Config, Formats, Parsers, Paths, Schemes, Tools, Utils, VERSION, Workflow Classes: Counter, Dataset, Metadata, Repository, Resource, Runner, SystemCallError

Constant Summary collapse

RunnerError =
Class.new(IMW::Error)
LOG_FILE_DESTINATION =

Default log file.

STDERR
LOG_TIMEFORMAT =

Default log file time format

"%Y-%m-%d %H:%M:%S "
VERBOSE =

Default verbosity

false
PROGRESS_TRACKERS =
{}
PROGRESS_COUNTERS =
{}
Error =

Base error class which all IMW errors subclass.

Class.new(StandardError)
NoMethodError =

Method undefined.

Class.new(Error)
TypeError =

Type error.

Class.new(Error)
NotImplementedError =

Not implemented (typically because user needs to define a method when subclassing a base class).

Class.new(Error)
ParseError =

Error during parsing.

Class.new(Error)
PathError =

Error with a non-existing, invalid, or inaccessible path.

Class.new(Error)
NetworkError =

Error communicating with a remote entity.

Class.new(Error)
SchemeError =

Raised when a resource is of the wrong scheme.

Class.new(Error)
FormatError =

Raised when a resource is of the wrong (or malformed) format.

Class.new(Error)
ArgumentError =

Bad argument.

Class.new(Error)
SchemaError =

Error in defining or matching a schema.

Class.new(Error)
DEFAULT_PATHS =

Default paths for the IMW. Chosen to make sense on most *NIX distributions.

{
  :home         => ENV['HOME'],
  :data_root    => "/var/lib/imw",
  :log_root     => "/var/log/imw",
  :scripts_root => "/usr/share/imw",
  :tmp_root     => "/tmp/imw",

  # the imw library
  :imw_root  => File.expand_path(File.dirname(__FILE__) + "/../../.."),
  :imw_bin   => [:imw_root, 'bin'],
  :imw_etc   => [:imw_root, 'etc'],
  :imw_lib   => [:imw_root, 'lib'],

  # workflow
  :ripd_root  => [:data_root, 'ripd'],
  :rawd_root  => [:data_root, 'rawd'],
  :fixd_root  => [:data_root, 'fixd'],
  :pkgd_root  => [:data_root, 'pkgd']
}
Task =

An IMW version of Rake::Task

Class.new(Rake::Task)
FileTask =

An IMW subclass of Rake:FileTask

Class.new(Rake::FileTask)
FileCreationTask =

An IMW subclass of Rake::FileCreationTask

Class.new(Rake::FileCreationTask)
COMPRESSION_SETTINGS =

Default settings used when compressing files. :program defines the name of the command-line program to use, :compress gives the flags to use when compressing, and :extension gives the extension (without the ‘.’) added by the program after compressing.

{
  :program   => 'bzip2',
  :compress  => '',
  :extension => 'bz2'
}

Class Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Class Attribute Details

.logObject

Returns the value of attribute log.



14
15
16
# File 'lib/imw/utils/log.rb', line 14

def log
  @log
end

.verboseObject

Returns the value of attribute verbose.



14
15
16
# File 'lib/imw/utils/log.rb', line 14

def verbose
  @verbose
end

Class Method Details

.add_path(sym, *pathsegs) ⇒ String

Adds a symbolic path for expansion by path_to.

IMW.add_path :foo, '~/whoa'
IMW.add_path :bar, :foo,   'baz'
IMW.path_to :bar
=> '~/whoa/baz'

Parameters:

  • sym (Symbol)

    the name of the path to store

  • pathsegs (Symbol, String)

    the path segments to use to define the path to the name

Returns:

  • (String)

    the resulting path



122
123
124
125
# File 'lib/imw/utils/paths.rb', line 122

def self.add_path sym, *pathsegs
  IMW::PATHS[sym] = pathsegs.flatten
  path_to[sym]
end

.announce(*events) ⇒ Object



36
37
38
39
40
# File 'lib/imw/utils/log.rb', line 36

def self.announce *events
  options = events.flatten.extract_options!
  options.reverse_merge! :level => Logger::INFO
  IMW.log.add options[:level], "IMW: " + events.join("\n")
end

.announce_if_verbose(*events) ⇒ Object



41
42
43
# File 'lib/imw/utils/log.rb', line 41

def self.announce_if_verbose *events
  announce(*events) if IMW.verbose?
end


45
46
47
48
49
# File 'lib/imw/utils/log.rb', line 45

def self.banner *events
  options = events.flatten.extract_options!
  options.reverse_merge! :level => Logger::INFO
  announce(["*"*75, events, "*"*75], options)
end

.dataset(handle, options = {}, &block) ⇒ IMW::Dataset

Create a dataset and put it in the default IMW repository.

Evaluates the given block in the context of the new dataset. This allows you to define tasks, add paths, and use defined metadata in an elegant way.

IMW.dataset :my_dataset do

  # Define some paths we're going to use
  add_path :original, :rawd, 'original.csv'
  add_path :filtered, :fixd, 'filtered.csv'
  add_path :package,  :pkgd, 'filtered.tar.bz2'

  # Copy a CSV filefrom a website to this machine.
  rip do
    open('http://mysite.com/data_archives/2010/03/03.csv').cp(path_to(:original))
  end

  # Filter the original CSV data by the
  # <tt>meets_some_condition?</tt> method we define elsewhere...
  munge do
    open!(path_to(:filtered)) do |filtered|
      open(path_to(:original)).each do |row|
        filtered << row if meets_some_condition?(row)
    end
  end

  # Compress the filtered data to an archive.
  package do
    open(path_to(:filtered)).compress.mv(path_to(:package))
  end
end

See the /examples directory of the IMW distribution for more examples.

Parameters:

  • handle (Symbol, String)

    the handle to identify this dataset with

  • options (Hash) (defaults to: {})

    a hash of options (see IMW::Dataset)

Returns:



155
156
157
158
159
# File 'lib/imw.rb', line 155

def self.dataset handle, options={}, &block
  d = IMW::Dataset.new(handle, options.merge(:repository => IMW.repository))
  d.instance_eval(&block) if block_given?
  d
end

.dir!(uri, options = {}, &block) ⇒ IMW::Resource

Open (and create if necessary) a directory at the given URI.

Will automatically create directories recursively. Options will be passed to IMW.open and interpreted appropriately. If a block is passed, the directory will be created before the block is yielded to.

Parameters:

Returns:



85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/imw.rb', line 85

def self.dir! uri, options={}, &block
  if block_given?
    new_dir = open(uri, options.merge(:as => (options[:as] || []) + [Schemes::Local::LocalDirectory])) do |d|
      new_dir.create
      yield
    end
  else
    new_dir = open(uri, options.merge(:as => (options[:as] || []) + [Schemes::Local::LocalDirectory]))
    new_dir.create
  end
  new_dir
end

.instantiate_logger!Object

Create a Logger and point it at IMW::LOG_FILE_DESTINATION which is set in ~/.imwrc and defaults to STDERR.



30
31
32
33
34
# File 'lib/imw/utils/log.rb', line 30

def self.instantiate_logger!
  IMW.log ||= Logger.new(LOG_FILE_DESTINATION)
  IMW.log.datetime_format = "%Y%m%d-%H:%M:%S "
  IMW.log.level           = Logger::INFO
end

.open(obj, options = {}, &block) ⇒ IMW::Resource

Open a resource at the given uri. The resource will automatically be extended by modules which make sense given the uri.

See the documentation for IMW::Resource and the various modules within IMW::Resources for more information and options.

Passing in an IMW::Resource will simply return it.

Parameters:

Options Hash (options):

  • as (Array<String,Module>)

    same as :use_modules in IMW::Resource.extend_instance!

  • without (Array<String,Module>)

    same as :skip_modules in IMW::Resource.extend_instance!

Returns:

  • (IMW::Resource)

    the resulting resource, property extended for the given URI



59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# File 'lib/imw.rb', line 59

def self.open obj, options={}, &block
  if obj.is_a?(IMW::Resource)
    resource = obj
  else
    options[:use_modules]  ||= (options[:as]      || [])
    options[:skip_modules] ||= (options[:without] || [])
    resource = IMW::Resource.new(obj, options)
  end
  if block_given?
    yield resource
    resource.close
  else
    resource
  end
end

.open!(uri, options = {}, &block) ⇒ IMW::Resource

Works the same way as IMW.open except opens the resource for writing.

Parameters:

Returns:

  • (IMW::Resource)

    the resultng resource, properly extended for the given URI and opened for writing.



103
104
105
# File 'lib/imw.rb', line 103

def self.open! uri, options={}, &block
  open(uri, options.merge(:mode => 'w'), &block)
end

.path_to(*pathsegs) ⇒ String

Expands a shorthand workflow path specification to an actual file path. Strings are interpreted literally but symbols are first resolved to the paths they represent.

IMW.add_path :foo, '~/whoa'
IMW.path_to :foo, 'my_thing'
=> '~/whoa/my_thing'

Parameters:

Returns:

  • (String)

    the resulting expanded path



107
108
109
110
# File 'lib/imw/utils/paths.rb', line 107

def self.path_to *pathsegs
  path = Pathname.new IMW.path_to_helper(*pathsegs)
  path.absolute? ? File.expand_path(path) : path.to_s
end

.remove_path(sym) ⇒ Object

Removes a symbolic path for expansion by path_to.

Parameters:

  • sym (Symbol)

    the stored path symbol to remove



130
131
132
# File 'lib/imw/utils/paths.rb', line 130

def self.remove_path sym
  IMW::PATHS.delete sym if IMW::PATHS.include? sym
end

.repositoryIMW::Repository

The default repository in which to place datasets. See the documentation for IMW::Repository for more information on how datasets and repositories fit together.

Returns:



112
113
114
# File 'lib/imw.rb', line 112

def self.repository
  @@repository ||= IMW::Repository.new
end

.system(*commands) ⇒ Object

A replacement for the standard system call which raises an IMW::SystemCallError if the command fails which prints better debugging info.

This function relies upon Kernel.system and obeys the same rules:

  • if commands has only only a single element then no shell characters or spaces are escaped – you have to do it yourself or you get to use shell characters, depending on your perspective.

  • if commands is a list of elements then the second and further elements in the list have their shell characters and spaces escaped

But it also has its own rules:

  • When one of the commands is an empty or blank string, Kernel.system honors it and escapes it properly and sends it along for evaluation. This can be a problem for some programs and so IMW.system excludes blank (as in blank?) elements of commands.

  • commands will be flattened (see the gotcha below)

Calling out to the shell like this is often brittle. Imagine defining

prog  = 'some_prog'
flags = '-v -f'
args  = 'file.txt'

and later calling

IMW.system prog, flags, args

The space in the second argument (‘-v -f’) will be escaped and will therefore not be properly parsed by some_prog. Instead try

prog  = 'some_prog'
flags = ['-v', '-f']
args = ['file.txt']

IMW.system prog, flags, *args

which will work fine since flags will automatically be flattend.



54
55
56
57
58
59
60
# File 'lib/imw/utils/extensions.rb', line 54

def self.system *commands
  stripped_commands = commands.flatten.map { |command| command.to_s unless command.blank? }.compact
  IMW.announce_if_verbose(stripped_commands.join(" "))
  exit_code = Kernel.system(*stripped_commands)
  raise IMW::SystemCallError.new($?.dup, commands.join(' ')) unless $?.success?
  exit_code
end

.verbose?nil, ...

Is IMW operating in verbose mode?

Calls to IMW.warn_if_verbose and friends utilize this method. Verbosity is controlled on the command line (see IMW::Runner) or by setting IMW::VERBOSE in your configuration file.

Returns:

  • (nil, false, true)


24
25
26
# File 'lib/imw/utils/log.rb', line 24

def self.verbose?
  VERBOSE || verbose
end

.warn(*events) ⇒ Object



51
52
53
54
55
# File 'lib/imw/utils/log.rb', line 51

def self.warn *events
  options = events.flatten.extract_options!
  options.reverse_merge! :level => Logger::WARN
  announce events, options
end

.warn_if_verbose(*events) ⇒ Object



56
57
58
# File 'lib/imw/utils/log.rb', line 56

def self.warn_if_verbose *events
  warn(*events) if IMW.verbose?
end

Instance Method Details

#track_count(tracker, every = 1000) ⇒ Object

Log repetitions in a given context

At every n’th (default 1000) call, announce progress in the IMW.log



84
85
86
87
88
89
# File 'lib/imw/utils/log.rb', line 84

def track_count tracker, every=1000
  PROGRESS_COUNTERS[tracker] ||= 0
  PROGRESS_COUNTERS[tracker]  += 1
  chunk = every * (PROGRESS_COUNTERS[tracker]/every).to_i
  track_progress "count_of_#{tracker}", chunk
end

#track_progress(tracker, val) ⇒ Object

When the slowly-changing tracked variable var changes value, announce its new value. Always announces on first call.

Ex:

track_progress :indexing_names, name[0..0] # announce at each initial letter
track_progress :files, (i % 1000)          # announce at each 1,000 iterations


69
70
71
72
73
74
75
# File 'lib/imw/utils/log.rb', line 69

def track_progress tracker, val
  unless (IMW::PROGRESS_TRACKERS.include?(tracker)) &&
         (IMW::PROGRESS_TRACKERS[tracker] == val)
    announce "#{tracker.to_s.gsub(/_/,' ')}: #{val}"
    IMW::PROGRESS_TRACKERS[tracker] = val
  end
end