Class: NewspaperWorks::TextExtraction::HOCRReader

Inherits:
Object
  • Object
show all
Defined in:
lib/newspaper_works/text_extraction/hocr_reader.rb

Overview

Class to obtain plain text and JSON word-coordinates from hOCR source

- Coordinates in px units, unlike ALTO, which may have scaling concerns

Defined Under Namespace

Classes: HOCRDocStream

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(html) ⇒ HOCRReader

Construct with either path or HTML [String]

Parameters:

  • html (String)

    , and process document



144
145
146
147
148
149
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 144

def initialize(html)
  @source = isxml?(html) ? html : File.read(html)
  @doc_stream = HOCRDocStream.new
  parser = Nokogiri::HTML::SAX::Parser.new(doc_stream)
  parser.parse(@source)
end

Instance Attribute Details

#doc_streamObject

Returns the value of attribute doc_stream.



11
12
13
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 11

def doc_stream
  @doc_stream
end

#sourceObject

Returns the value of attribute source.



11
12
13
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 11

def source
  @source
end

Instance Method Details

#isxml?(xml) ⇒ true, false

Determine if source parameter is path or xml/html

Parameters:

  • xml (String)

    either path to xml file or xml source

Returns:

  • (true, false)

    true if value appears to be XML/HTML, not path



155
156
157
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 155

def isxml?(xml)
  xml.lstrip.start_with?('<')
end

#jsonString

Output JSON flattened word coordinates

Returns:

  • (String)

    JSON serialization of flattened word coordinates



162
163
164
165
166
167
168
169
170
# File 'lib/newspaper_works/text_extraction/hocr_reader.rb', line 162

def json
  words = @doc_stream.words
  builder = NewspaperWorks::TextExtraction::WordCoordsBuilder.new(
    words,
    @doc_stream.width,
    @doc_stream.height
  )
  builder.to_json
end