Module: IMW
- Defined in:
- lib/imw.rb,
lib/imw/boot.rb,
lib/imw/tools.rb,
lib/imw/utils.rb,
lib/imw/runner.rb,
lib/imw/dataset.rb,
lib/imw/formats.rb,
lib/imw/parsers.rb,
lib/imw/schemes.rb,
lib/imw/archives.rb,
lib/imw/metadata.rb,
lib/imw/resource.rb,
lib/imw/utils/log.rb,
lib/imw/repository.rb,
lib/imw/schemes/s3.rb,
lib/imw/utils/misc.rb,
lib/imw/formats/pdf.rb,
lib/imw/schemes/sql.rb,
lib/imw/utils/error.rb,
lib/imw/utils/paths.rb,
lib/imw/archives/rar.rb,
lib/imw/archives/tar.rb,
lib/imw/archives/zip.rb,
lib/imw/formats/json.rb,
lib/imw/formats/sgml.rb,
lib/imw/formats/yaml.rb,
lib/imw/metadata/dsl.rb,
lib/imw/parsers/flat.rb,
lib/imw/schemes/hdfs.rb,
lib/imw/schemes/http.rb,
lib/imw/dataset/paths.rb,
lib/imw/formats/excel.rb,
lib/imw/schemes/local.rb,
lib/imw/utils/has_uri.rb,
lib/imw/utils/version.rb,
lib/imw/archives/targz.rb,
lib/imw/metadata/field.rb,
lib/imw/schemes/remote.rb,
lib/imw/tools/archiver.rb,
lib/imw/archives/tarbz2.rb,
lib/imw/metadata/schema.rb,
lib/imw/compressed_files.rb,
lib/imw/dataset/workflow.rb,
lib/imw/tools/aggregator.rb,
lib/imw/tools/downloader.rb,
lib/imw/tools/summarizer.rb,
lib/imw/tools/transferer.rb,
lib/imw/utils/extensions.rb,
lib/imw/formats/delimited.rb,
lib/imw/compressed_files/gz.rb,
lib/imw/parsers/html_parser.rb,
lib/imw/parsers/line_parser.rb,
lib/imw/compressed_files/bz2.rb,
lib/imw/metadata/schematized.rb,
lib/imw/parsers/regexp_parser.rb,
lib/imw/tools/extension_analyzer.rb,
lib/imw/metadata/contains_metadata.rb,
lib/imw/parsers/html_parser/matchers.rb,
lib/imw/utils/dynamically_extendable.rb,
lib/imw/compressed_files/compressible.rb
Overview
The Infinite Monkeywrench (IMW) is a Ruby library for ripping, extracting, parsing, munging, and packaging datasets. It allows you to handle different data formats transparently as well as organize transformations of data as a network of dependencies (a la Make or Rake).
IMW has a few central concepts: resources, metadata, datasets, workflows, and repositories.
Resources represent individual data resources like local files, websites, databases, &c. An IMW::Resource is typically instantiated via IMW.open, with IMW doing the work of figuring out what to return based on the URI passed in.
A Resource can have a schema which describes the fields in its data. IMW::Metadata consists of classes which describe fields.
Datasets represent collections of related data resources .. An IMW::Dataset comes with a pre-defined (but customizable) workflow that takes data resources through several steps: rip, parse, munge, and package. The workflow leverages Rake and so the various tasks that are necessary to process the data till it is nice and pretty can all be linked with dependencies.
Repositories are collections of datasets and it is on these collections that the imw command line tool operates.
Defined Under Namespace
Modules: Archives, CompressedFiles, Config, Formats, Parsers, Paths, Schemes, Tools, Utils, VERSION, Workflow Classes: Counter, Dataset, Metadata, Repository, Resource, Runner, SystemCallError
Constant Summary collapse
- RunnerError =
Class.new(IMW::Error)
- LOG_FILE_DESTINATION =
Default log file.
STDERR- LOG_TIMEFORMAT =
Default log file time format
"%Y-%m-%d %H:%M:%S "- VERBOSE =
Default verbosity
false- PROGRESS_TRACKERS =
{}
- PROGRESS_COUNTERS =
{}
- Error =
Base error class which all IMW errors subclass.
Class.new(StandardError)
- NoMethodError =
Method undefined.
Class.new(Error)
- TypeError =
Type error.
Class.new(Error)
- NotImplementedError =
Not implemented (typically because user needs to define a method when subclassing a base class).
Class.new(Error)
- ParseError =
Error during parsing.
Class.new(Error)
- PathError =
Error with a non-existing, invalid, or inaccessible path.
Class.new(Error)
- NetworkError =
Error communicating with a remote entity.
Class.new(Error)
- SchemeError =
Raised when a resource is of the wrong scheme.
Class.new(Error)
- FormatError =
Raised when a resource is of the wrong (or malformed) format.
Class.new(Error)
- ArgumentError =
Bad argument.
Class.new(Error)
- SchemaError =
Error in defining or matching a schema.
Class.new(Error)
- DEFAULT_PATHS =
Default paths for the IMW. Chosen to make sense on most *NIX distributions.
{ :home => ENV['HOME'], :data_root => "/var/lib/imw", :log_root => "/var/log/imw", :scripts_root => "/usr/share/imw", :tmp_root => "/tmp/imw", # the imw library :imw_root => File.(File.dirname(__FILE__) + "/../../.."), :imw_bin => [:imw_root, 'bin'], :imw_etc => [:imw_root, 'etc'], :imw_lib => [:imw_root, 'lib'], # workflow :ripd_root => [:data_root, 'ripd'], :rawd_root => [:data_root, 'rawd'], :fixd_root => [:data_root, 'fixd'], :pkgd_root => [:data_root, 'pkgd'] }
- Task =
An IMW version of Rake::Task
Class.new(Rake::Task)
- FileTask =
An IMW subclass of Rake:FileTask
Class.new(Rake::FileTask)
- FileCreationTask =
An IMW subclass of Rake::FileCreationTask
Class.new(Rake::FileCreationTask)
- COMPRESSION_SETTINGS =
Default settings used when compressing files.
:programdefines the name of the command-line program to use,:compressgives the flags to use when compressing, and:extensiongives the extension (without the ‘.’) added by the program after compressing. { :program => 'bzip2', :compress => '', :extension => 'bz2' }
Class Attribute Summary collapse
-
.log ⇒ Object
Returns the value of attribute log.
-
.verbose ⇒ Object
Returns the value of attribute verbose.
Class Method Summary collapse
-
.add_path(sym, *pathsegs) ⇒ String
Adds a symbolic path for expansion by
path_to. - .announce(*events) ⇒ Object
- .announce_if_verbose(*events) ⇒ Object
- .banner(*events) ⇒ Object
-
.dataset(handle, options = {}, &block) ⇒ IMW::Dataset
Create a dataset and put it in the default IMW repository.
-
.dir!(uri, options = {}, &block) ⇒ IMW::Resource
Open (and create if necessary) a directory at the given URI.
-
.instantiate_logger! ⇒ Object
Create a Logger and point it at IMW::LOG_FILE_DESTINATION which is set in ~/.imwrc and defaults to STDERR.
-
.open(obj, options = {}, &block) ⇒ IMW::Resource
Open a resource at the given
uri. -
.open!(uri, options = {}, &block) ⇒ IMW::Resource
Works the same way as IMW.open except opens the resource for writing.
-
.path_to(*pathsegs) ⇒ String
Expands a shorthand workflow path specification to an actual file path.
-
.remove_path(sym) ⇒ Object
Removes a symbolic path for expansion by
path_to. -
.repository ⇒ IMW::Repository
The default repository in which to place datasets.
-
.system(*commands) ⇒ Object
A replacement for the standard system call which raises an IMW::SystemCallError if the command fails which prints better debugging info.
-
.verbose? ⇒ nil, ...
Is IMW operating in verbose mode?.
- .warn(*events) ⇒ Object
- .warn_if_verbose(*events) ⇒ Object
Instance Method Summary collapse
-
#track_count(tracker, every = 1000) ⇒ Object
Log repetitions in a given context.
-
#track_progress(tracker, val) ⇒ Object
When the slowly-changing tracked variable
varchanges value, announce its new value.
Class Attribute Details
.log ⇒ Object
Returns the value of attribute log.
14 15 16 |
# File 'lib/imw/utils/log.rb', line 14 def log @log end |
.verbose ⇒ Object
Returns the value of attribute verbose.
14 15 16 |
# File 'lib/imw/utils/log.rb', line 14 def verbose @verbose end |
Class Method Details
.add_path(sym, *pathsegs) ⇒ String
Adds a symbolic path for expansion by path_to.
IMW.add_path :foo, '~/whoa'
IMW.add_path :bar, :foo, 'baz'
IMW.path_to :bar
=> '~/whoa/baz'
122 123 124 125 |
# File 'lib/imw/utils/paths.rb', line 122 def self.add_path sym, *pathsegs IMW::PATHS[sym] = pathsegs.flatten path_to[sym] end |
.announce(*events) ⇒ Object
36 37 38 39 40 |
# File 'lib/imw/utils/log.rb', line 36 def self.announce *events = events.flatten. .reverse_merge! :level => Logger::INFO IMW.log.add [:level], "IMW: " + events.join("\n") end |
.announce_if_verbose(*events) ⇒ Object
41 42 43 |
# File 'lib/imw/utils/log.rb', line 41 def self.announce_if_verbose *events announce(*events) if IMW.verbose? end |
.banner(*events) ⇒ Object
45 46 47 48 49 |
# File 'lib/imw/utils/log.rb', line 45 def self. *events = events.flatten. .reverse_merge! :level => Logger::INFO announce(["*"*75, events, "*"*75], ) end |
.dataset(handle, options = {}, &block) ⇒ IMW::Dataset
Create a dataset and put it in the default IMW repository.
Evaluates the given block in the context of the new dataset. This allows you to define tasks, add paths, and use defined metadata in an elegant way.
IMW.dataset :my_dataset do
# Define some paths we're going to use
add_path :original, :rawd, 'original.csv'
add_path :filtered, :fixd, 'filtered.csv'
add_path :package, :pkgd, 'filtered.tar.bz2'
# Copy a CSV filefrom a website to this machine.
rip do
open('http://mysite.com/data_archives/2010/03/03.csv').cp(path_to(:original))
end
# Filter the original CSV data by the
# <tt>meets_some_condition?</tt> method we define elsewhere...
munge do
open!(path_to(:filtered)) do |filtered|
open(path_to(:original)).each do |row|
filtered << row if meets_some_condition?(row)
end
end
# Compress the filtered data to an archive.
package do
open(path_to(:filtered)).compress.mv(path_to(:package))
end
end
See the /examples directory of the IMW distribution for more examples.
155 156 157 158 159 |
# File 'lib/imw.rb', line 155 def self.dataset handle, ={}, &block d = IMW::Dataset.new(handle, .merge(:repository => IMW.repository)) d.instance_eval(&block) if block_given? d end |
.dir!(uri, options = {}, &block) ⇒ IMW::Resource
Open (and create if necessary) a directory at the given URI.
Will automatically create directories recursively. Options will be passed to IMW.open and interpreted appropriately. If a block is passed, the directory will be created before the block is yielded to.
85 86 87 88 89 90 91 92 93 94 95 96 |
# File 'lib/imw.rb', line 85 def self.dir! uri, ={}, &block if block_given? new_dir = open(uri, .merge(:as => ([:as] || []) + [Schemes::Local::LocalDirectory])) do |d| new_dir.create yield end else new_dir = open(uri, .merge(:as => ([:as] || []) + [Schemes::Local::LocalDirectory])) new_dir.create end new_dir end |
.instantiate_logger! ⇒ Object
Create a Logger and point it at IMW::LOG_FILE_DESTINATION which is set in ~/.imwrc and defaults to STDERR.
30 31 32 33 34 |
# File 'lib/imw/utils/log.rb', line 30 def self.instantiate_logger! IMW.log ||= Logger.new(LOG_FILE_DESTINATION) IMW.log.datetime_format = "%Y%m%d-%H:%M:%S " IMW.log.level = Logger::INFO end |
.open(obj, options = {}, &block) ⇒ IMW::Resource
Open a resource at the given uri. The resource will automatically be extended by modules which make sense given the uri.
See the documentation for IMW::Resource and the various modules within IMW::Resources for more information and options.
Passing in an IMW::Resource will simply return it.
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# File 'lib/imw.rb', line 59 def self.open obj, ={}, &block if obj.is_a?(IMW::Resource) resource = obj else [:use_modules] ||= ([:as] || []) [:skip_modules] ||= ([:without] || []) resource = IMW::Resource.new(obj, ) end if block_given? yield resource resource.close else resource end end |
.open!(uri, options = {}, &block) ⇒ IMW::Resource
Works the same way as IMW.open except opens the resource for writing.
103 104 105 |
# File 'lib/imw.rb', line 103 def self.open! uri, ={}, &block open(uri, .merge(:mode => 'w'), &block) end |
.path_to(*pathsegs) ⇒ String
Expands a shorthand workflow path specification to an actual file path. Strings are interpreted literally but symbols are first resolved to the paths they represent.
IMW.add_path :foo, '~/whoa'
IMW.path_to :foo, 'my_thing'
=> '~/whoa/my_thing'
107 108 109 110 |
# File 'lib/imw/utils/paths.rb', line 107 def self.path_to *pathsegs path = Pathname.new IMW.path_to_helper(*pathsegs) path.absolute? ? File.(path) : path.to_s end |
.remove_path(sym) ⇒ Object
Removes a symbolic path for expansion by path_to.
130 131 132 |
# File 'lib/imw/utils/paths.rb', line 130 def self.remove_path sym IMW::PATHS.delete sym if IMW::PATHS.include? sym end |
.repository ⇒ IMW::Repository
The default repository in which to place datasets. See the documentation for IMW::Repository for more information on how datasets and repositories fit together.
112 113 114 |
# File 'lib/imw.rb', line 112 def self.repository @@repository ||= IMW::Repository.new end |
.system(*commands) ⇒ Object
A replacement for the standard system call which raises an IMW::SystemCallError if the command fails which prints better debugging info.
This function relies upon Kernel.system and obeys the same rules:
-
if
commandshas only only a single element then no shell characters or spaces are escaped – you have to do it yourself or you get to use shell characters, depending on your perspective. -
if
commandsis a list of elements then the second and further elements in the list have their shell characters and spaces escaped
But it also has its own rules:
-
When one of the
commandsis an empty or blank string, Kernel.system honors it and escapes it properly and sends it along for evaluation. This can be a problem for some programs and so IMW.system excludes blank (as inblank?) elements ofcommands. -
commandswill be flattened (see the gotcha below)
Calling out to the shell like this is often brittle. Imagine defining
prog = 'some_prog'
flags = '-v -f'
args = 'file.txt'
and later calling
IMW.system prog, flags, args
The space in the second argument (‘-v -f’) will be escaped and will therefore not be properly parsed by some_prog. Instead try
prog = 'some_prog'
flags = ['-v', '-f']
args = ['file.txt']
IMW.system prog, flags, *args
which will work fine since flags will automatically be flattend.
54 55 56 57 58 59 60 |
# File 'lib/imw/utils/extensions.rb', line 54 def self.system *commands stripped_commands = commands.flatten.map { |command| command.to_s unless command.blank? }.compact IMW.announce_if_verbose(stripped_commands.join(" ")) exit_code = Kernel.system(*stripped_commands) raise IMW::SystemCallError.new($?.dup, commands.join(' ')) unless $?.success? exit_code end |
.verbose? ⇒ nil, ...
Is IMW operating in verbose mode?
Calls to IMW.warn_if_verbose and friends utilize this method. Verbosity is controlled on the command line (see IMW::Runner) or by setting IMW::VERBOSE in your configuration file.
24 25 26 |
# File 'lib/imw/utils/log.rb', line 24 def self.verbose? VERBOSE || verbose end |
.warn(*events) ⇒ Object
51 52 53 54 55 |
# File 'lib/imw/utils/log.rb', line 51 def self.warn *events = events.flatten. .reverse_merge! :level => Logger::WARN announce events, end |
.warn_if_verbose(*events) ⇒ Object
56 57 58 |
# File 'lib/imw/utils/log.rb', line 56 def self.warn_if_verbose *events warn(*events) if IMW.verbose? end |
Instance Method Details
#track_count(tracker, every = 1000) ⇒ Object
Log repetitions in a given context
At every n’th (default 1000) call, announce progress in the IMW.log
84 85 86 87 88 89 |
# File 'lib/imw/utils/log.rb', line 84 def track_count tracker, every=1000 PROGRESS_COUNTERS[tracker] ||= 0 PROGRESS_COUNTERS[tracker] += 1 chunk = every * (PROGRESS_COUNTERS[tracker]/every).to_i track_progress "count_of_#{tracker}", chunk end |
#track_progress(tracker, val) ⇒ Object
When the slowly-changing tracked variable var changes value, announce its new value. Always announces on first call.
Ex:
track_progress :indexing_names, name[0..0] # announce at each initial letter
track_progress :files, (i % 1000) # announce at each 1,000 iterations
69 70 71 72 73 74 75 |
# File 'lib/imw/utils/log.rb', line 69 def track_progress tracker, val unless (IMW::PROGRESS_TRACKERS.include?(tracker)) && (IMW::PROGRESS_TRACKERS[tracker] == val) announce "#{tracker.to_s.gsub(/_/,' ')}: #{val}" IMW::PROGRESS_TRACKERS[tracker] = val end end |