Class: Sqed::Parser::OcrParser

Inherits:
Sqed::Parser show all
Defined in:
lib/sqed/parser/ocr_parser.rb

Overview

encoding: UTF-8

Given a single image return all text in that image.

For reference

http://misteroleg.wordpress.com/2012/12/19/ocr-using-tesseract-and-imagemagick-as-pre-processing-task/
https://code.google.com/p/tesseract-ocr/wiki/FAQ
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version

“There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be ”noise removed“.

Constant Summary collapse

TYPE =
:text
SECTION_PARAMS =

Tesseract parameters default/specific to section type, default is merged into the type

{
  default: {
    psm: 3
  },
  annotated_specimen: {
    # was 45, significantly improves annotated_specimen for odontates
    edges_children_count_limit: 3000 
  },
  identifier: {
    psm: 1,
    # tessedit_char_whitelist: '0123456789'
    #  edges_children_count_limit: 4000
  },
  curator_metadata: {
    psm: 3
  },
  labels: {
    psm: 3, # may need to be 6
  },
  determination_labels: {
    psm: 3
  },
  other_labels: {
    psm: 3
  },
  collecting_event_labels: {
    psm: 3
  }
}.freeze

Instance Attribute Summary

Attributes inherited from Sqed::Parser

#extracted_text, #image

Instance Method Summary collapse

Methods inherited from Sqed::Parser

#initialize

Constructor Details

This class inherits a constructor from Sqed::Parser

Instance Method Details

#get_text(section_type: :default) ⇒ String

TODO: very kludge

Returns:

  • (String)

    the ocr text



112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# File 'lib/sqed/parser/ocr_parser.rb', line 112

def get_text(section_type: :default)
  img = image

  # resample if an image 4"x4" is less than 300dpi 
  if img.columns * img.rows < 144000
    img = img.resample(300)
  end

  params = SECTION_PARAMS[:default].dup
  params.merge!(SECTION_PARAMS[section_type])

  # May be able to overcome this hacky kludge messe with providing `processor:` to new
  file = Tempfile.new('foo1', encoding: 'utf-8')
  begin
    file.write(image.to_blob.force_encoding('utf-8'))
    file.rewind
    @extracted_text = RTesseract.new(file.path, params).to_s&.strip
    file.close
  ensure
    file.close
    file.unlink   # deletes the temp file
  end

  if @extracted_text == ''
    file = Tempfile.new('foo2', encoding: 'utf-8')
    begin
      file.write(img.dup.white_threshold(245).to_blob.force_encoding('utf-8'))
      file.rewind
      @extracted_text = RTesseract.new(file.path, params).to_s&.strip
      file.close
    ensure
      file.close
      file.unlink
    end
  end

  if @extracted_text == ''
    file = Tempfile.new('foo3', encoding: 'utf-8')
    begin
      file.write(img.dup.quantize(256, Magick::GRAYColorspace).to_blob.force_encoding('utf-8'))
      file.rewind
      @extracted_text = RTesseract.new(file.path, params).to_s&.strip
      file.close
    ensure
      file.close
      file.unlink
    end
  end

  @extracted_text
end