Module: IMW::Formats::Delimited Abstract

Includes:
Enumerable
Included in:
Csv, Tsv
Defined in:
lib/imw/formats/delimited.rb

Overview

This module is abstract.

Defines methods used for parsing and writing delimited data formats (CSV, TSV, &c.) with the FasterCSV library. This module is not used to directly extend a resource. Instead, more specific modules (e.g. - IMW::Resources::Formats::Csv) include this one and also define delimited_options which is actually what’s passed to FasterCSV.

Instance Method Summary collapse

Instance Method Details

#delimited_optionsHash

Default options to be passed to FasterCSV; see its documentation for more information.

Returns:



19
20
21
22
23
# File 'lib/imw/formats/delimited.rb', line 19

def delimited_options
  @delimited_options ||= {
    :headers        => fields && fields.map { |field| field['name'] }
  }.merge(resource_options_compatible_with_faster_csv)
end

#each(&block) ⇒ Object

Call block with each row in this delimited resource.



41
42
43
44
# File 'lib/imw/formats/delimited.rb', line 41

def each &block
  require 'fastercsv'
  FasterCSV.new(io, delimited_options).each(&block)
end

#emit(data, options = {}) ⇒ Object Also known as: <<

Emit a single array or an array of arrays into this resource.

Parameters:

  • data (Array<Array>, Array)

    array or array of arrays to emit

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :persist (true, false)

    Keep this resource’s IO object open after emiting



51
52
53
54
55
56
57
58
# File 'lib/imw/formats/delimited.rb', line 51

def emit data, options={}
  require 'fastercsv'
  data = [data] unless data.first.is_a?(Array)
  data.each do |row|
    write(FasterCSV.generate_line(row, delimited_options))
  end
  self
end

#fields_in_first_line?true, false

Do a heuristic check to determine whether or not the first row of this delimited data is a row of headers.

Returns:

  • (true, false)


65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/imw/formats/delimited.rb', line 65

def fields_in_first_line?
  # grab the header and up to 10 body rows
  require 'fastercsv'
  copy  = FasterCSV.new(io, resource_options_compatible_with_faster_csv.merge(:headers => false))
  header = (copy.shift || []) rescue []
  body   = 10.times.map { (copy.shift || []) rescue []}.flatten

  # guess how many elements in a row
  #size_guess = ((header.size + body.map(&:size).inject(0.0) { |e, s| s += e }).to_f / (1 + body.length).to_f).to_i
  
  # calculate the fraction of bytes that are [-A-z_] (letters +
  # underscore + hypen) for header and body and compute a
  # threshold determinant
  header_chars           = header.map(&:to_s).join
  header_schema_bytes    = header_chars.bytes.find_all { |byte| (byte >= 65 && byte <= 90) || (byte >= 97 && byte <= 122) || byte == 95 || byte == 45 }
  body_chars             = body.map(&:to_s).join
  body_schema_bytes      = body_chars.bytes.find_all { |byte| (byte >= 65 && byte <= 90) || (byte >= 97 && byte <= 122) || byte == 95 || byte == 45 }
  header_schema_fraction = header_schema_bytes.size.to_f / header_chars.size.to_f    rescue nil
  body_schema_fraction   = body_schema_bytes.size.to_f   / body_chars.size.to_f      rescue nil
  determinant            = (body_schema_fraction - header_schema_fraction).abs / 2.0 rescue nil

  # decide, setting the threshold at 0.05 based on some guesswork...
  determinant && determinant >= 0.05
end

#guess_fields!Object

If it seems like there are fields in the first line of this data then go ahead and use them to define this resource’s fields.

Will overwrite any fields already present for this resource.



95
96
97
98
99
100
101
# File 'lib/imw/formats/delimited.rb', line 95

def guess_fields!
  return unless fields_in_first_line?
  copy                        = FasterCSV.new(io, resource_options_compatible_with_faster_csv.merge(:headers => false))
  names                       = (copy.shift || []) rescue []
  self.fields                 = names.map { |n| { 'name' => n } }
  delimited_options[:headers] = names
end

#load {|Array| ... } ⇒ Array

Return the data in this delimited resource as an array of arrays.

Yield each outer array (row) if passed a block.

Yields:

  • (Array)

    each row of the data

Returns:

  • (Array)

    the full data matrix



32
33
34
35
# File 'lib/imw/formats/delimited.rb', line 32

def load &block
  require 'fastercsv'
  FasterCSV.parse(read, delimited_options, &block)
end

#snippetArray<Array>

Return a 10-line sample of this file.

Returns:



106
107
108
109
110
111
112
113
114
115
116
# File 'lib/imw/formats/delimited.rb', line 106

def snippet
  require 'fastercsv'
  returning([]) do |rows|
    row_num = 1
    each do |row|
      break if row_num > 10
      rows << row.size.times.map { |index| row[index] }
      row_num += 1
    end
  end
end