Class: ODSExtractor::SAXHandler

Inherits:
Nokogiri::XML::SAX::Document
  • Object
show all
Defined in:
lib/ods_extractor/sax_handler.rb

Overview

Because few people use this often: an XML SAX parser parses the document in a streaming fashion instead of reconstructing a DOM tree. You can build a DOM tree based on a SAX parser but not the other way around. SAX parsers become useful when parsing documents which are very large - and our “contents.xml” is damn large indeed. To use a SAX parser we need to write a handler, in the handler we are going to capture the elements we care about. The OpenOasis schema structures a single sheet inside a “table:table” element, then every row is in “table:table-row”, then every cell is within “table:table-cell”. Anything further down we can treat as text and just capture “as-is” (there are some wrapper tags for paragraphs but these are not really important for our mission).

Constant Summary collapse

MAX_CELLS_PER_ROW =
2**14
MAX_ROWS_PER_SHEET =
2**20

Instance Method Summary collapse

Constructor Details

#initialize(output_handler) ⇒ SAXHandler

Returns a new instance of SAXHandler.



14
15
16
# File 'lib/ods_extractor/sax_handler.rb', line 14

def initialize(output_handler)
  @out = output_handler
end

Instance Method Details

#characters(string) ⇒ Object



53
54
55
56
57
# File 'lib/ods_extractor/sax_handler.rb', line 53

def characters(string)
  # @charbuf is only not-nil when we are inside a "table:table-cell" element, this allows us to skip
  # any chardata that is outside of cells for whatever reason
  @charbuf << string if @charbuf
end

#end_element(name) ⇒ Object



59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/ods_extractor/sax_handler.rb', line 59

def end_element(name)
  case name
    when "table:table"
      @out.end_sheet
    when "table:table-row"
      @rows_output_so_far += @row_repeats_n_times
      @row_repeats_n_times.times do
        @out.write_row(@row)
      end
    when "table:table-cell"
      @cell_repeats_n_times.times { @row << @charbuf.strip } # Have to strip due to XML having sometimes-significant whitespace
      @charbuf = nil
  end
end

#start_element(name, attributes = []) ⇒ Object



18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/ods_extractor/sax_handler.rb', line 18

def start_element(name, attributes = [])
  case name
    when "table:table"
      sheet_name = attributes.to_h.fetch("table:name")
      @out.start_sheet(sheet_name)
      @rows_output_so_far = 0
    when "table:table-row"
      # Here be dragons: https://stackoverflow.com/a/2741709/153886
      # Both rows and cells are actually _sparsely_ recorded in the XML, see below
      # for the same for cells.
      @row_repeats_n_times = attributes.to_h.fetch("table:number-rows-repeated", "1").to_i
      if @rows_output_so_far + @row_repeats_n_times >= MAX_ROWS_PER_SHEET
        # The ODS table contains "at most"
        # 1048576 rows. When we are at the last row, ODS will helpfully
        # tell us that there are "that many" repeat empty rows until the end of sheet.
        # These cells are useless for us of course, but if we repeat them literally
        # we will still output them to the CSV. We can use this to detect our last row.
        @row_repeats_n_times = 0
      end
      # and prepare an empty row
      @row = []
    when "table:table-cell"
      @cell_repeats_n_times = attributes.to_h.fetch("table:number-columns-repeated", "1").to_i
      if @row.length + @cell_repeats_n_times >= MAX_CELLS_PER_ROW
        # Again something pertinent: the ODS table contains "at most"
        # 2**14 columns - 16384. When we are at the last cell of the row, ODS will helpfully
        # tell us that there are "that many" repeat empty cells until the next row starts.
        # We can thus detect the last row by it having number-columns-repeated which creates N
        # similar cells. If we encounter that we can simply omit that cell, it is most certainly empty
        @cell_repeats_n_times = 0
      end
      @charbuf = String.new(capacity: 512) # Create a string which is unlikely to be resized all the time
  end
end