Class: NewspaperWorks::TextExtraction::HOCRReader
- Inherits:
-
Object
- Object
- NewspaperWorks::TextExtraction::HOCRReader
- Defined in:
- lib/newspaper_works/text_extraction/hocr_reader.rb
Overview
Class to obtain plain text and JSON word-coordinates from hOCR source
- Coordinates in px units, unlike ALTO, which may have scaling concerns
Defined Under Namespace
Classes: HOCRDocStream
Instance Attribute Summary collapse
-
#doc_stream ⇒ Object
Returns the value of attribute doc_stream.
-
#source ⇒ Object
Returns the value of attribute source.
Instance Method Summary collapse
-
#initialize(html) ⇒ HOCRReader
constructor
Construct with either path or HTML [String].
-
#isxml?(xml) ⇒ true, false
Determine if source parameter is path or xml/html.
-
#json ⇒ String
Output JSON flattened word coordinates.
Constructor Details
#initialize(html) ⇒ HOCRReader
Construct with either path or HTML [String]
144 145 146 147 148 149 |
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 144 def initialize(html) @source = isxml?(html) ? html : File.read(html) @doc_stream = HOCRDocStream.new parser = Nokogiri::HTML::SAX::Parser.new(doc_stream) parser.parse(@source) end |
Instance Attribute Details
#doc_stream ⇒ Object
Returns the value of attribute doc_stream.
11 12 13 |
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 11 def doc_stream @doc_stream end |
#source ⇒ Object
Returns the value of attribute source.
11 12 13 |
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 11 def source @source end |
Instance Method Details
#isxml?(xml) ⇒ true, false
Determine if source parameter is path or xml/html
155 156 157 |
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 155 def isxml?(xml) xml.lstrip.start_with?('<') end |
#json ⇒ String
Output JSON flattened word coordinates
162 163 164 165 166 167 168 169 170 |
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 162 def json words = @doc_stream.words builder = NewspaperWorks::TextExtraction::WordCoordsBuilder.new( words, @doc_stream.width, @doc_stream.height ) builder.to_json end |