Class: OCRSDK::PDF

Inherits:
Image show all
Defined in:
lib/ocrsdk/pdf.rb

Constant Summary

Constants included from Verifiers::Profile

Verifiers::Profile::PROFILES

Constants included from Verifiers::Format

Verifiers::Format::INPUT_FORMATS, Verifiers::Format::OUTPUT_FORMATS

Constants included from Verifiers::Language

Verifiers::Language::LANGUAGES

Instance Method Summary collapse

Methods inherited from Image

#as_pdf, #as_pdf_sync, #as_text, #as_text_sync, #as_xml, #as_xml_sync, #initialize

Methods included from Verifiers::Profile

#profile_to_s, #supported_profile?

Methods included from Verifiers::Format

#format_to_s, #supported_input_format?, #supported_output_format?

Methods included from Verifiers::Language

#language_to_s, #language_to_sym, #languages_to_s, #supported_language?

Methods inherited from AbstractEntity

#initialize

Constructor Details

This class inherits a constructor from OCRSDK::Image

Instance Method Details

#recognizeable?Boolean

We’re on a shaky ground regarding what kind of pdfs should be recognized and what shouldn’t. Currently we count that if there are

images * 20 > length of text

then this document might need recognition.

Assumption is that there might be a title, page numbers or credits along with images.

In case of title page we also skip the first page which should not affect documents which will not need to be recognized

Returns:

  • (Boolean)


15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'lib/ocrsdk/pdf.rb', line 15

def recognizeable?
  reader = PDF::Reader.new @image_path

  images = 0
  text   = 0
  chars  = Set.new
  start = reader.pages.length > 1 ? 1 : 0
  reader.pages[start..-1].each do |page|
    text   += page.text.length
    chars  += page.text.split('').map(&:ord).uniq
    images += page.xobjects.map {|k, v| v.hash[:Subtype]}.count(:Image)
  end

  # count number of distinct characters
  # in case of "searchable", but incorrectly recognized document
  images * 20 > text || chars.length < 10
rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError
  false
end