Class: SimpleXlsxReader::Document::Mapper

Inherits:
Struct
  • Object
show all
Defined in:
lib/simple_xlsx_reader.rb

Overview

For internal use; translates source xml to Sheet objects.

Constant Summary collapse

DATE_SYSTEM_1900 =
Date.new(1899, 12, 30)
DATE_SYSTEM_1904 =
Date.new(1904, 1, 1)
NumFmtMap =

Map of non-custom numFmtId to casting symbol

{
  0  => :string,         # General
  1  => :fixnum,         # 0
  2  => :float,          # 0.00
  3  => :fixnum,         # #,##0
  4  => :float,          # #,##0.00
  5  => :unsupported,    # $#,##0_);($#,##0)
  6  => :unsupported,    # $#,##0_);[Red]($#,##0)
  7  => :unsupported,    # $#,##0.00_);($#,##0.00)
  8  => :unsupported,    # $#,##0.00_);[Red]($#,##0.00)
  9  => :percentage,     # 0%
  10 => :percentage,     # 0.00%
  11 => :bignum,         # 0.00E+00
  12 => :unsupported,    # # ?/?
  13 => :unsupported,    # # ??/??
  14 => :date,           # mm-dd-yy
  15 => :date,           # d-mmm-yy
  16 => :date,           # d-mmm
  17 => :date,           # mmm-yy
  18 => :time,           # h:mm AM/PM
  19 => :time,           # h:mm:ss AM/PM
  20 => :time,           # h:mm
  21 => :time,           # h:mm:ss
  22 => :date_time,      # m/d/yy h:mm
  37 => :unsupported,    # #,##0 ;(#,##0)
  38 => :unsupported,    # #,##0 ;[Red](#,##0)
  39 => :unsupported,    # #,##0.00;(#,##0.00)
  40 => :unsupported,    # #,##0.00;[Red](#,##0.00)
  45 => :time,           # mm:ss
  46 => :time,           # [h]:mm:ss
  47 => :time,           # mmss.0
  48 => :bignum,         # ##0.0E+0
  49 => :unsupported     # @
}

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Instance Attribute Details

#xmlObject

Returns the value of attribute xml

Returns:

  • (Object)

    the current value of xml



103
104
105
# File 'lib/simple_xlsx_reader.rb', line 103

def xml
  @xml
end

Class Method Details

.cast(value, type, style, options = {}) ⇒ Object

The heart of typecasting. The ruby type is determined either explicitly from the cell xml or implicitly from the cell style, and this method expects that work to have been done already. This, then, takes the type we determined it to be and casts the cell value to that type.

types:

  • s: shared string (see #shared_string)

  • n: number (cast to a float)

  • b: boolean

  • str: string

  • inlineStr: string

  • ruby symbol: for when type has been determined by style

options:

  • shared_strings: needed for ‘s’ (shared string) type



322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# File 'lib/simple_xlsx_reader.rb', line 322

def self.cast(value, type, style, options = {})
  return nil if value.nil? || value.empty?

  # Sometimes the type is dictated by the style alone
  if type.nil? ||
    (type == 'n' && [:date, :time, :date_time].include?(style))
    type = style
  end

  case type

  ##
  # There are few built-in types
  ##

  when 's' # shared string
    options[:shared_strings][value.to_i]
  when 'n' # number
    value.to_f
  when 'b'
    value.to_i == 1
  when 'str'
    value
  when 'inlineStr'
    value

  ##
  # Type can also be determined by a style,
  # detected earlier and cast here by its standardized symbol
  ##

  when :string, :unsupported
    value
  when :fixnum
    value.to_i
  when :float
    value.to_f
  when :percentage
    value.to_f / 100
  # the trickiest. note that  all these formats can vary on
  # whether they actually contain a date, time, or datetime.
  when :date, :time, :date_time
    value = value.to_f
    days_since_date_system_start = value.to_i
    fraction_of_24 = value - days_since_date_system_start

    # http://stackoverflow.com/questions/10559767/how-to-convert-ms-excel-date-from-float-to-date-format-in-ruby
    date = options.fetch(:base_date, DATE_SYSTEM_1900) + days_since_date_system_start

    if fraction_of_24 > 0 # there is a time associated
      seconds = (fraction_of_24 * 86400).round
      return Time.utc(date.year, date.month, date.day) + seconds
    else
      return date
    end
  when :bignum
    if defined?(BigDecimal)
      BigDecimal.new(value)
    else
      value.to_f
    end

  ##
  # Beats me
  ##

  else
    value
  end
end

Instance Method Details

#base_dateObject

Returns the base_date from which to calculate dates. Defaults to 1900 (minus two days due to excel quirk), but use 1904 if it’s set in the Workbook’s workbookPr. msdn.microsoft.com/en-us/library/ff530155(v=office.12).aspx



397
398
399
400
401
402
403
404
405
406
# File 'lib/simple_xlsx_reader.rb', line 397

def base_date
  @base_date ||=
    begin
      return DATE_SYSTEM_1900 if xml.workbook == nil
      xml.workbook.xpath("//workbook/workbookPr[@date1904]").each do |workbookPr|
        return DATE_SYSTEM_1904 if workbookPr["date1904"] =~ /true|1/i
      end
      DATE_SYSTEM_1900
    end
end

#column_letter_to_number(column_letter) ⇒ Object

formula fits an exponential factorial function of the form: ‘A’ = 1 ‘B’ = 2 ‘Z’ = 26 ‘AA’ = 26 * 1 + 1 ‘AZ’ = 26 * 1 + 26 ‘BA’ = 26 * 2 + 1 ‘ZA’ = 26 * 26 + 1 ‘ZZ’ = 26 * 26 + 26 ‘AAA’ = 26 * 26 * 1 + 26 * 1 + 1 ‘AAZ’ = 26 * 26 * 1 + 26 * 1 + 26 ‘ABA’ = 26 * 26 * 1 + 26 * 2 + 1 ‘BZA’ = 26 * 26 * 2 + 26 * 26 + 1



221
222
223
224
225
226
227
228
229
# File 'lib/simple_xlsx_reader.rb', line 221

def column_letter_to_number(column_letter)
  pow = column_letter.length - 1
  result = 0
  column_letter.each_byte do |b|
    result += 26**pow * (b - 64)
    pow -= 1
  end
  result
end

#custom_style_typesObject

Map of (numFmtId >= 164) (custom styles) to our best guess at the type ex. => :date_time



277
278
279
280
281
282
283
284
285
286
287
# File 'lib/simple_xlsx_reader.rb', line 277

def custom_style_types
  @custom_style_types ||=
    xml.styles.xpath('/styleSheet/numFmts/numFmt').
    inject({}) do |acc, xstyle|

    acc[xstyle.attributes['numFmtId'].value.to_i] =
      determine_custom_style_type(xstyle.attributes['formatCode'].value)

    acc
  end
end

#determine_custom_style_type(string) ⇒ Object

This is the least deterministic part of reading xlsx files. Due to custom styles, you can’t know for sure when a date is a date other than looking at its format and gessing. It’s not impossible to guess right, though.

stackoverflow.com/questions/4948998/determining-if-an-xlsx-cell-is-date-formatted-for-excel-2007-spreadsheets



295
296
297
298
299
300
301
302
303
# File 'lib/simple_xlsx_reader.rb', line 295

def determine_custom_style_type(string)
  return :float if string[0] == '_'
  return :float if string[0] == ' 0'

  # Looks for one of ymdhis outside of meta-stuff like [Red]
  return :date_time if string =~ /(^|\])[^\[]*[ymdhis]/i

  return :unsupported
end

#last_cell_label(xsheet) ⇒ Object

Returns the last column name, ex. ‘E’

Note that excel writes a ‘/worksheet/dimension’ node we can get the last cell from, but some libs (ex. simple_xlsx_writer) don’t record this. In that case, we assume the data is of uniform column length and check the column name of the last header row. Obviously this isn’t the most robust strategy, but it likely fits 99% of use cases considering it’s not a problem with actual excel docs.



191
192
193
194
195
196
197
198
199
200
# File 'lib/simple_xlsx_reader.rb', line 191

def last_cell_label(xsheet)
  dimension = xsheet.at_xpath('/worksheet/dimension')
  if dimension
    col = dimension.attributes['ref'].value.match(/:([A-Z]+[0-9]+)/)
    col ? col.captures.first : 'A1'
  else
    last = xsheet.at_xpath("/worksheet/sheetData/row[last()]/c[last()]")
    last ? last.attributes['r'].value.match(/([A-Z]+[0-9]+)/).captures.first : 'A1'
  end
end

#load_sheetsObject



107
108
109
110
111
# File 'lib/simple_xlsx_reader.rb', line 107

def load_sheets
  sheet_toc.each_with_index.map do |(sheet_name, _sheet_number), i|
    parse_sheet(sheet_name, xml.sheets[i])  # sheet_number is *not* the index into xml.sheets
  end
end

#num_fmt_id(xstyle) ⇒ Object

returns the numFmtId value if it’s available



252
253
254
255
256
257
258
# File 'lib/simple_xlsx_reader.rb', line 252

def num_fmt_id(xstyle)
  if xstyle.attributes['numFmtId']
    xstyle.attributes['numFmtId'].value
  else
    nil
  end
end

#parse_sheet(sheet_name, xsheet) ⇒ Object



125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
# File 'lib/simple_xlsx_reader.rb', line 125

def parse_sheet(sheet_name, xsheet)
  sheet = Sheet.new(sheet_name)
  sheet_width, sheet_height = *sheet_dimensions(xsheet)

  sheet.rows = Array.new(sheet_height) { Array.new(sheet_width) }
  xsheet.xpath("/worksheet/sheetData/row/c").each do |xcell|
    column, row = *xcell.attr('r').match(/([A-Z]+)([0-9]+)/).captures
    col_idx = column_letter_to_number(column) - 1
    row_idx = row.to_i - 1

    type  = xcell.attributes['t'] &&
            xcell.attributes['t'].value
    style = xcell.attributes['s'] &&
            style_types[xcell.attributes['s'].value.to_i]

    # This is the main performance bottleneck. Using just 'xcell.text'
    # would be ideal, and makes parsing super-fast. However, there's
    # other junk in the cell, formula references in particular,
    # so we really do have to look for specific value nodes.
    # Maybe there is a really clever way to use xcell.text and parse out
    # the correct value, but I can't think of one, or an alternative
    # strategy.
    #
    # And yes, this really is faster than using xcell.at_xpath(...),
    # by about 60%. Odd.
    xvalue = type == 'inlineStr' ?
      (xis = xcell.children.find {|c| c.name == 'is'}) && xis.children.find {|c| c.name == 't'} :
      xcell.children.find {|c| c.name == 'v'}

    cell = begin
      self.class.cast(xvalue && xvalue.text.strip, type, style,
                      :shared_strings => shared_strings,
                      :base_date => base_date)
    rescue => e
      if !SimpleXlsxReader.configuration.catch_cell_load_errors
        error = CellLoadError.new(
          "Row #{row_idx}, Col #{col_idx}: #{e.message}")
        error.set_backtrace(e.backtrace)
        raise error
      else
        sheet.load_errors[[row_idx, col_idx]] = e.message

        xcell.text.strip
      end
    end

    # This shouldn't be necessary, but just in case, we'll create
    # the row so we don't blow up. This means any null rows in between
    # will be null instead of [null, null, ...]
    sheet.rows[row_idx] ||= Array.new(sheet_width)

    sheet.rows[row_idx][col_idx] = cell
  end

  sheet
end

#shared_stringsObject

For performance reasons, excel uses an optional SpreadsheetML feature that puts all strings in a separate xml file, and then references them by their index in that file.

msdn.microsoft.com/en-us/library/office/gg278314.aspx



449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# File 'lib/simple_xlsx_reader.rb', line 449

def shared_strings
  @shared_strings ||= begin
    if xml.shared_strings
      xml.shared_strings.xpath('/sst/si').map do |xsst|
        # a shared string can be a single value...
        sst = xsst.at_xpath('t/text()')
        sst = sst.text if sst
        # ... or a composite of seperately styled words/characters
        sst ||= xsst.xpath('r/t/text()').map(&:text).join
      end
    else
      []
    end
  end
end

#sheet_dimensions(xsheet) ⇒ Object

Returns dimensions (1-indexed)



203
204
205
206
# File 'lib/simple_xlsx_reader.rb', line 203

def sheet_dimensions(xsheet)
  column, row = *last_cell_label(xsheet).match(/([A-Z]+)([0-9]+)/).captures
  [column_letter_to_number(column), row.to_i]
end

#sheet_tocObject

Table of contents for the sheets, ex. => 0, …



114
115
116
117
118
119
120
121
122
123
# File 'lib/simple_xlsx_reader.rb', line 114

def sheet_toc
  xml.workbook.xpath('/workbook/sheets/sheet').
    inject({}) do |acc, sheet|

    acc[sheet.attributes['name'].value] =
      sheet.attributes['sheetId'].value.to_i - 1 # keep things 0-indexed

    acc
  end
end

#style_type_by_num_fmt_id(id) ⇒ Object

Finds the type we think a style is; For example, fmtId 14 is a date style, so this would return :date.

Note, custom styles usually (are supposed to?) have a numFmtId >= 164, but in practice can sometimes be simply out of the usual “Any Language” id range that goes up to 49. For example, I have seen a numFmtId of 59 specified as a date. In Thai, 59 is a number format, so this seems like a bad idea, but we try to be flexible and just go with it.



268
269
270
271
272
273
# File 'lib/simple_xlsx_reader.rb', line 268

def style_type_by_num_fmt_id(id)
  return nil if id.nil?

  id = id.to_i
  NumFmtMap[id] || custom_style_types[id]
end

#style_typesObject

Excel doesn’t record types for some cells, only its display style, so we have to back out the type from that style.

Some of these styles can be determined from a known set (see NumFmtMap), while others are ‘custom’ and we have to make a best guess.

This is the array of types corresponding to the styles a spreadsheet uses, and includes both the known style types and the custom styles.

Note that the xml sheet cells that use this don’t reference the numFmtId, but instead the array index of a style in the stored list of only the styles used in the spreadsheet (which can be either known or custom). Hence this style types array, rather than a map of numFmtId to type.



245
246
247
248
249
# File 'lib/simple_xlsx_reader.rb', line 245

def style_types
  @style_types ||=
      xml.styles.xpath('/styleSheet/cellXfs/xf').map {|xstyle|
        style_type_by_num_fmt_id(num_fmt_id(xstyle))}
end