Class: IiifPrint::TextExtraction::HOCRReader
- Inherits:
-
Object
- Object
- IiifPrint::TextExtraction::HOCRReader
- Defined in:
- lib/iiif_print/text_extraction/hocr_reader.rb
Overview
Class to obtain plain text and JSON word-coordinates from hOCR source
- Coordinates in px units, unlike ALTO, which may have scaling concerns
Defined Under Namespace
Classes: HOCRDocStream
Instance Attribute Summary collapse
-
#doc_stream ⇒ Object
Returns the value of attribute doc_stream.
-
#source ⇒ Object
Returns the value of attribute source.
Instance Method Summary collapse
-
#initialize(html) ⇒ HOCRReader
constructor
Construct with either path or HTML [String].
-
#isxml?(xml) ⇒ true, false
Determine if source parameter is path or xml/html.
-
#json ⇒ String
Output JSON flattened word coordinates.
Constructor Details
#initialize(html) ⇒ HOCRReader
Construct with either path or HTML [String]
148 149 150 151 152 153 |
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 148 def initialize(html) @source = isxml?(html) ? html : File.read(html) @doc_stream = HOCRDocStream.new parser = Nokogiri::HTML::SAX::Parser.new(doc_stream) parser.parse(@source) end |
Instance Attribute Details
#doc_stream ⇒ Object
Returns the value of attribute doc_stream.
11 12 13 |
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 11 def doc_stream @doc_stream end |
#source ⇒ Object
Returns the value of attribute source.
11 12 13 |
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 11 def source @source end |
Instance Method Details
#isxml?(xml) ⇒ true, false
Determine if source parameter is path or xml/html
159 160 161 |
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 159 def isxml?(xml) xml.lstrip.start_with?('<') end |
#json ⇒ String
Output JSON flattened word coordinates
166 167 168 169 170 171 172 173 |
# File 'lib/iiif_print/text_extraction/hocr_reader.rb', line 166 def json words = @doc_stream.words IiifPrint::TextExtraction::WordCoordsBuilder.json_coordinates_for( words: words, width: @doc_stream.width, height: @doc_stream.height ) end |