Class: IiifPrint::TextExtraction::HOCRReader

Inherits:
Object
  • Object
show all
Defined in:
lib/iiif_print/text_extraction/hocr_reader.rb

Overview

Class to obtain plain text and JSON word-coordinates from hOCR source

- Coordinates in px units, unlike ALTO, which may have scaling concerns

Defined Under Namespace

Classes: HOCRDocStream

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(html) ⇒ HOCRReader

Construct with either path or HTML [String]

Parameters:

  • html (String)

    , and process document



148
149
150
151
152
153
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 148

def initialize(html)
  @source = isxml?(html) ? html : File.read(html)
  @doc_stream = HOCRDocStream.new
  parser = Nokogiri::HTML::SAX::Parser.new(doc_stream)
  parser.parse(@source)
end

Instance Attribute Details

#doc_streamObject

Returns the value of attribute doc_stream.



11
12
13
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 11

def doc_stream
  @doc_stream
end

#sourceObject

Returns the value of attribute source.



11
12
13
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 11

def source
  @source
end

Instance Method Details

#isxml?(xml) ⇒ true, false

Determine if source parameter is path or xml/html

Parameters:

  • xml (String)

    either path to xml file or xml source

Returns:

  • (true, false)

    true if value appears to be XML/HTML, not path



159
160
161
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 159

def isxml?(xml)
  xml.lstrip.start_with?('<')
end

#jsonString

Output JSON flattened word coordinates

Returns:

  • (String)

    JSON serialization of flattened word coordinates



166
167
168
169
170
171
172
173
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 166

def json
  words = @doc_stream.words
  IiifPrint::TextExtraction::WordCoordsBuilder.json_coordinates_for(
    words: words,
    width: @doc_stream.width,
    height: @doc_stream.height
  )
end