OCRSDK Build Status Dependency Status Code Climate

An Abbyy's OCRSDK API wrapper in Ruby.

Terminology

Abbyy uses terms "Image" and "Document" in a way that might be confusing at first, so let's get it straight from the beginning:

  • Image - is a single input file which result in a single output, it might be multi-page pdf document as well as jpeg image.
  • Document - is a collection of files which result in a single output, e.g. a collection of scanned pages in tiff format.

Installation

gem install ocrsdk

Configuration

OCRSDK.setup do |config|
  config.application_id = '99bottlesofbeer'
  config.password = '98bottlesofbeer'

  # How much time in seconds wait between requests
  config.default_poll_time = 3 # default

  # How many times retry before rendering request as failed
  config.number_or_retries = 3 # default
  # How much time wait before retries
  config.retry_wait_time   = 3 # default 
end

Usage

There are two basic workflows - synchornous and asynchronous. The first one is simpler, but since recognition may take a significant amount of time and you may want to utilize the same thread (or give response to the user, that processing is started) while document is processed there is also an asynchronous version of each function.

Simple

image = OCRSDK::Image.new '~/why_cats_paint.pdf'

# sync
image.as_text_sync([:english]) # => "because they can"
image.as_pdf_sync([:english], '~/why_cats_paint_ocred.pdf') # # => 31337 (bytes written)

# async
promise = image.as_text([:english])
puts "Your document would be ready in #{promise.estimate_completion}"
promise.wait.result # byte-string, you might need to .force_encoding("utf-8") => "because they can"

Advanced

# have the same methods as Image + a few format-specific
pdf = OCRSDK::PDF.new '~/why_cats_paint.pdf'
unless pdf.recognizeable?
  return puts "Your document is already recognized"
end

promise = pdf.as_pdf([:english]) # pdf with images as pdf recognized
puts "Your document would be ready in #{promise.estimate_completion}"

while promise.processing?
  begin
    promise.update
    sleep 5
  rescue OCRSDK::NotEnoughCredits
    return puts "You need to purchase more credits for your account"
  end
end

if promise.completed?
  File.open('~/why_cats_paint_ocred.pdf', 'wb+') {|f| f.write promise.result }
else
  puts "Processing failed"
end

Testing

In order to make tests determenistic and fast you might want to mock all network interactions. For this purpose OCRSDK introduce Mock module built on top of Webmock gem. It can be used with any testing environment including RSpec, Capybara, Cucumber. To mock all gem request you need to call on of those functions before running tests:

  • OCRSDK::Mock.success - image was correctly submitted, promise will return :completed status and you will be able to retrieve a result;
  • OCRSDK::Mock.in_progress - image was correctly submitted, promise will return :in_progress status for every request, which means you wouldn't be able to get the result;
  • OCRSDK::Mock.not_enough_credits - submission of image would raise OCRSDK::NotEnoughCredits error.

For example you can mock ocrsdk in controller test like this:

# spec_helper.rb
require 'ocrsdk/mock'

# some_controller_spec.rb

require 'spec_helper'

describe SomeController do
  before { OCRSDK::Mock.success }

  describe "POST recognize" do
    # ...
  end
end

Copyright

Copytright © 2012 Andrey Korzhuev. See LICENSE for details.