Class: Importer::DataReader

Inherits:
Object
  • Object
show all
Defined in:
lib/iron/import/data_reader.rb

Overview

Base class for our input reading - dealing with the raw file/stream, and extracting raw values. In addition, we provide the base data coercion/parsing for our derived classes.

Direct Known Subclasses

CsvReader, CustomReader, ExcelReader, HtmlReader

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(importer, format) ⇒ DataReader

Returns a new instance of DataReader.



103
104
105
106
107
# File 'lib/iron/import/data_reader.rb', line 103

def initialize(importer, format)
  @importer = importer
  @format = format
  @supports = []
end

Instance Attribute Details

#formatObject (readonly)

Attributes



9
10
11
# File 'lib/iron/import/data_reader.rb', line 9

def format
  @format
end

Class Method Details

.for_format(importer, format) ⇒ Object

Factory method to build a reader from an explicit format selector



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/iron/import/data_reader.rb', line 45

def self.for_format(importer, format)
  case format
  when :csv
    CsvReader.new(importer)
  when :xls
    verify_roo!
    XlsReader.new(importer)
  when :xlsx
    verify_roo!
    XlsxReader.new(importer)
  when :html
    verify_nokogiri!
    HtmlReader.new(importer)
  else
    nil
  end
end

.for_path(importer, path) ⇒ Object

Figure out which format to use for a given path based on file name



64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/iron/import/data_reader.rb', line 64

def self.for_path(importer, path)
  format = path.to_s.extract(/\.(csv|tsv|html?|xlsx?)\z/i)
  if format
    format = format.downcase
    format = 'html' if format == 'htm'
    format = 'csv' if format == 'tsv'
    format = format.to_sym
    for_format(importer, format)
  else
    nil
  end
end

.for_source(importer, source) ⇒ Object

Implement our automatic reader selection, based on the import source



28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/iron/import/data_reader.rb', line 28

def self.for_source(importer, source)
  data = nil
  if is_stream?(source)
    data = DataReader::for_stream(importer, source)
    unless data
      importer.add_error("Unable to find format handler for stream")
    end
  else
    data = DataReader::for_path(importer, source)
    unless data
      importer.add_error("Unable to find format handler for file #{source}")
    end
  end
  data
end

.for_stream(importer, stream) ⇒ Object

Figure out which format to use based on a stream’s source file info



78
79
80
81
# File 'lib/iron/import/data_reader.rb', line 78

def self.for_stream(importer, stream)
  path = path_from_stream(stream)
  for_path(importer, path)
end

.is_stream?(source) ⇒ Boolean

Attempt to determine if the given source is a stream

Returns:

  • (Boolean)


84
85
86
87
88
# File 'lib/iron/import/data_reader.rb', line 84

def self.is_stream?(source)
  # For now, just assume anything that has a #read method is a stream, in
  # duck-type fashion
  source.respond_to?(:read)
end

.path_from_stream(stream) ⇒ Object

Try to find the original file name for the given stream, as in the case where a file is uploaded to Rails and we’re dealing with an ActionDispatch::Http::UploadedFile.



93
94
95
96
97
98
99
100
101
# File 'lib/iron/import/data_reader.rb', line 93

def self.path_from_stream(stream)
  if stream.respond_to?(:original_filename)
    stream.original_filename
  elsif stream.respond_to?(:path)
    stream.path
  else
    nil
  end
end

.verify_nokogiri!Object



19
20
21
22
23
24
25
# File 'lib/iron/import/data_reader.rb', line 19

def self.verify_nokogiri!
  if Gem::Specification.find_all_by_name('nokogiri', '>= 1.6.0').empty?
    raise "You are attempting to use the iron-import gem to import an HTML file.  Doing so requires installing the nokogiri gem, version 1.6.0 or later."
  else
    require 'nokogiri'
  end
end

.verify_roo!Object



11
12
13
14
15
16
17
# File 'lib/iron/import/data_reader.rb', line 11

def self.verify_roo!
  if Gem::Specification.find_all_by_name('roo', '>= 1.13.0').empty?
    raise "You are attempting to use the iron-import gem to import an Excel file.  Doing so requires installing the roo gem, version 1.13.0 or later."
  else
    require 'roo'
  end
end

Instance Method Details

#add_error(*args) ⇒ Object



303
304
305
# File 'lib/iron/import/data_reader.rb', line 303

def add_error(*args)
  @importer.add_error(*args)
end

#add_exception(*args) ⇒ Object



307
308
309
# File 'lib/iron/import/data_reader.rb', line 307

def add_exception(*args)
  @importer.add_exception(*args)
end

#init_source(mode, source) ⇒ Object

Override this method in derived classes to set up the given source in the given mode



205
206
207
# File 'lib/iron/import/data_reader.rb', line 205

def init_source(mode, source)
  raise "Unimplemented method #init_source in data reader #{self.class.name}"
end

#load(path_or_stream, scopes = nil, &block) ⇒ Object

Core data reader method. Takes a given input source (either a stream or a file path) and attempts to load it. Returns true if successful, false if not. If false, there will be one or more errors explaining what went wrong.

Passed scopes are interpreted by each derived class as makes sense, but generally are used to target seaching in multi-block formats such as Excel spreadsheets (sheet name/index) or HTML documents (css selectors, xpath selectors). If scopes is nil, all possible blocks will be checked.

Each block is read in as raw data from the source, and passed to the given block as an array of arrays. If the block returns true, processing is stopped and no further blocks will be checked.



142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# File 'lib/iron/import/data_reader.rb', line 142

def load(path_or_stream, scopes = nil, &block)
  # Figure out what we've been passed, and handle it
  if self.class.is_stream?(path_or_stream)
    # We have a stream (open file, upload, whatever)
    if supports_stream?
      # Stream loader defined, run it
      load_each(:stream, path_or_stream, scopes, &block)
    else
      # Write to temp file, as some of our readers only read physical files, annoyingly
      file = Tempfile.new(['importer', ".#{format}"])
      file.binmode
      begin
        file.write path_or_stream.read
        file.close
        load_each(:file, file.path, scopes, &block)
      ensure
        file.close
        file.unlink
      end
    end
    
  elsif path_or_stream.is_a?(String)
    # Assume it's a path
    is_path = File.exist?(path_or_stream) rescue false
    if is_path
      if supports_file?
        # We're all set, load up the given path
        load_each(:file, path_or_stream, scopes, &block)
      else
        # No file handler, so open the file and run the stream processor
        file = File.open(path_or_stream, 'rb')
        load_each(:stream, file, scopes, &block)
      end
    else
      add_error("Unable to locate source file with path #{path_or_stream.slice(0,200)}")
    end
    
  else
    add_error("Unable to load data source - not a file path or stream: #{path_or_stream.inspect}")
  end
  
  # Return our status
  !@importer.has_errors?
end

#load_each(mode, source, scopes, &block) ⇒ Object

Load up the sheet in the correct mode



188
189
190
191
192
193
194
195
196
197
198
199
200
201
# File 'lib/iron/import/data_reader.rb', line 188

def load_each(mode, source, scopes, &block)
  # Handle some common error cases centrally
  if mode == :file && !File.exist?(source)
    add_error("File not found: #{source}")
    return
  end
  
  # Let our derived classes open the file, etc. as they need
  if init_source(mode, source)
    # Once the source is set, run through each defined sheet, pass it to
    # our sheet loader, and have the sheet parse it out.
    load_raw(scopes, &block)
  end
end

#load_raw(scopes, &block) ⇒ Object

Override this method in derived classes to take the given sheet definition, find that sheet in the input source, and read out the raw (unparsed) rows as an array of arrays. Return false if the sheet cannot be loaded.



212
213
214
# File 'lib/iron/import/data_reader.rb', line 212

def load_raw(scopes, &block)
  raise "Unimplemented method #load_raw in data reader #{self.class.name}"
end

#parse_value(val, type) ⇒ Object

Provides default value parsing/coersion for all derived data readers. Attempts to be clever and handle edge cases like converting ‘5.00’ to 5 when in integer mode, etc. If you find your inputs aren’t being parsed correctly, add a custom #parse block on your Column definition.



219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
# File 'lib/iron/import/data_reader.rb', line 219

def parse_value(val, type)
  return nil if val.nil? || val.to_s.strip == ''
  
  case type
  when :raw then
    val
    
  when :string then
    if val.is_a?(Float)
      # Sometimes float values come in for "integer" columns from Excel,
      # so if the user asks for a string, strip off that ".0" if present
      val.to_s.gsub(/\.0+$/, '')
    else
      # Strip whitespace and we're good to go
      val.to_s.strip
    end
    
  when :integer, :int then 
    if val.class < Numeric
      # If numeric, verify that there's no decimal places to worry about
      if (val.to_f % 1.0 == 0.0)
        val.to_i
      else
        nil
      end
    else 
      # Convert to string, strip off trailing decimal zeros
      val = val.to_s.strip.gsub(/\.0*$/, '')
      if val.integer?
        val.to_i
      else
        nil
      end
    end
    
  when :float then
    if val.class < Numeric
      val.to_f
    else 
      # Clean up then verify it matches a valid float format & convert
      val = val.to_s.strip
      if val.match(/\A-?[0-9]+(?:\.[0-9]+)?\z/)
        val.to_f
      else
        nil
      end
    end
    
  when :cents then
    if val.is_a?(String)
      val = val.gsub(/\s*\$\s*/, '')
    end
    intval = parse_value(val, :integer)
    if !val.is_a?(Float) && intval
      intval * 100
    else
      floatval = parse_value(val, :float)
      if floatval
        (floatval * 100).round
      else
        nil
      end
    end
    
  when :date then
    # Pull out the date part of the string and convert
    date_str = val.to_s.extract(/[0-9]+[\-\/][0-9]+[\-\/][0-9]+/)
    date_str.to_date rescue nil
  
  when :bool then
    val_str = parse_value(val, :string).to_s.downcase
    if ['true','yes','y','t','1'].include?(val_str)
      return true
    elsif ['false','no','n','f','0'].include?(val_str)
      return false
    else
      nil
    end
  
  else
    raise "Unknown column type #{type.inspect} - unimplemented?"
  end
end

#supports?(mode) ⇒ Boolean

Returns:

  • (Boolean)


109
110
111
# File 'lib/iron/import/data_reader.rb', line 109

def supports?(mode)
  @supports.include?(mode)
end

#supports_file!Object



117
118
119
# File 'lib/iron/import/data_reader.rb', line 117

def supports_file!
  @supports << :file
end

#supports_file?Boolean

Returns:

  • (Boolean)


121
122
123
# File 'lib/iron/import/data_reader.rb', line 121

def supports_file?
  supports?(:file)
end

#supports_stream!Object



113
114
115
# File 'lib/iron/import/data_reader.rb', line 113

def supports_stream!
  @supports << :stream
end

#supports_stream?Boolean

Returns:

  • (Boolean)


125
126
127
# File 'lib/iron/import/data_reader.rb', line 125

def supports_stream?
  supports?(:stream)
end