Class: Sqed::Parser::OcrParser

Inherits:

Sqed::Parser

Object
Sqed::Parser
Sqed::Parser::OcrParser

show all

Defined in:: lib/sqed/parser/ocr_parser.rb

Overview

encoding: UTF-8

Given a single image return all text in that image.

For reference

http://misteroleg.wordpress.com/2012/12/19/ocr-using-tesseract-and-imagemagick-as-pre-processing-task/
https://code.google.com/p/tesseract-ocr/wiki/FAQ
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version

“There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be ”noise removed“.

Constant Summary collapse

TYPE =

:text

SECTION_PARAMS = Tesseract parameters default/specific to section type, default is merged into the type

{
  default: {
    psm: 3
  },
  annotated_specimen: {
    # was 45, significantly improves annotated_specimen for odontates
    edges_children_count_limit: 3000 
  },
  identifier: {
    psm: 1,
    # tessedit_char_whitelist: '0123456789'
    #  edges_children_count_limit: 4000
  },
  curator_metadata: {
    psm: 3
  },
  labels: {
    psm: 3, # may need to be 6
  },
  determination_labels: {
    psm: 3
  },
  other_labels: {
    psm: 3
  },
  collecting_event_labels: {
    psm: 3
  }
}.freeze

Instance Attribute Summary

Attributes inherited from Sqed::Parser

#extracted_text, #image

Instance Method Summary collapse

#get_text(section_type: :default) ⇒ String

TODO: very kludge.

Methods inherited from Sqed::Parser

#initialize

Constructor Details

This class inherits a constructor from Sqed::Parser

Instance Method Details

#get_text(section_type: :default) ⇒ `String`

TODO: very kludge

Returns:

(String) —

the ocr text

# File 'lib/sqed/parser/ocr_parser.rb', line 112

def get_text(section_type: :default)
  img = image

  # resample if an image 4"x4" is less than 300dpi 
  if img.columns * img.rows < 144000
    img = img.resample(300)
  end

  params = SECTION_PARAMS[:default].dup
  params.merge!(SECTION_PARAMS[section_type])

  # May be able to overcome this hacky kludge messe with providing `processor:` to new
  file = Tempfile.new('foo1', encoding: 'utf-8')
  begin
    file.write(image.to_blob.force_encoding('utf-8'))
    file.rewind
    @extracted_text = RTesseract.new(file.path, params).to_s&.strip
    file.close
  ensure
    file.close
    file.unlink   # deletes the temp file
  end

  if @extracted_text == ''
    file = Tempfile.new('foo2', encoding: 'utf-8')
    begin
      file.write(img.dup.white_threshold(245).to_blob.force_encoding('utf-8'))
      file.rewind
      @extracted_text = RTesseract.new(file.path, params).to_s&.strip
      file.close
    ensure
      file.close
      file.unlink
    end
  end

  if @extracted_text == ''
    file = Tempfile.new('foo3', encoding: 'utf-8')
    begin
      file.write(img.dup.quantize(256, Magick::GRAYColorspace).to_blob.force_encoding('utf-8'))
      file.rewind
      @extracted_text = RTesseract.new(file.path, params).to_s&.strip
      file.close
    ensure
      file.close
      file.unlink
    end
  end

  @extracted_text
end