Class: Treat::Workers::Formatters::Readers::Image

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/formatters/readers/image.rb

Overview

This class is a wrapper for the Google Ocropus optical character recognition (OCR) engine.

“OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi- lingual capabilities.”

Original paper: Google Ocropus Engine: Breuel, Thomas M. The Ocropus Open Source OCR System. DFKI and U. Kaiserslautern, Germany.

Class Method Summary collapse

Class Method Details

.create_temp_dir(&block) ⇒ Object

Create a dir that gets deleted after execution of the block.



42
43
44
45
46
47
48
49
50
51
52
# File 'lib/treat/workers/formatters/readers/image.rb', line 42

def self.create_temp_dir(&block)
  if not FileTest.directory?(Treat.paths.tmp)
    FileUtils.mkdir(Treat.paths.tmp)
  end
  dname = Treat.paths.tmp +
  "#{Random.rand(10000000).to_s}"
  Dir.mkdir(dname)
  block.call(dname)
ensure
  FileUtils.rm_rf(dname)
end

.read(document, options = {}) ⇒ Object

Read a file using the Google Ocropus reader.

Options:

  • (Boolean) :silent => whether to silence Ocropus.



20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# File 'lib/treat/workers/formatters/readers/image.rb', line 20

def self.read(document, options = {})

  read = lambda do |doc|
    self.create_temp_dir do |tmp|
      `ocropus-nlbin -o #{tmp}/out #{doc.file}`
      `ocropus-gpageseg #{tmp}/out/????.bin.png --minscale 2`
      `ocropus-rpred #{tmp}/out/????/??????.bin.png`
      `ocropus-hocr #{tmp}/out/????.bin.png -o #{tmp}/book.html`
      doc.set :file,  "#{tmp}/book.html"
      doc.set :format, :html

      doc = doc.read(:html)
    end
  end

  Treat.core.verbosity.silence ? silence_stdout {
  read.call(document) } : read.call(document)

  document
end