Class: Docsplit::TextExtractor

Inherits:

Object

Object
Docsplit::TextExtractor

show all

Defined in:: lib/docsplit/text_extractor.rb

Overview

Delegates to pdftotext and tesseract in order to extract text from PDF documents. The ‘–ocr` and `–no-ocr` flags can be used to force or forbid OCR extraction, but by default the heuristic works like this:

* Check for the presence of fonts in the PDF. If no fonts are detected,
  OCR is used automatically.
* Extract the text of each page with **pdftotext**, if the page has less
  than 100 bytes of text (a scanned image page, or a page that just
  contains a filename and a page number), then add it to the list of
  `@pages_to_ocr`.
* Re-OCR each page in the `@pages_to_ocr` list at the end.

Constant Summary collapse

NO_TEXT_DETECTED =

/---------\n\Z/

OCR_FLAGS =

'-density 400x400 -colorspace GRAY'.freeze

MEMORY_ARGS =

'-limit memory 256MiB -limit map 512MiB'.freeze

MIN_TEXT_PER_PAGE = in bytes

Instance Method Summary collapse

#contains_text?(pdf) ⇒ Boolean

Does a PDF have any text embedded?.
#extract(pdfs, opts) ⇒ Object

Extract text from a list of PDFs.
#extract_from_ocr(pdf, pages) ⇒ Object

Extract a page range worth of text from a PDF via OCR.
#extract_from_pdf(pdf, pages) ⇒ Object

Extract a page range worth of text from a PDF, directly.
#initialize ⇒ TextExtractor constructor

A new instance of TextExtractor.

Constructor Details

#initialize ⇒ `TextExtractor`

Returns a new instance of TextExtractor.



22
23
24

# File 'lib/docsplit/text_extractor.rb', line 22

def initialize
  @pages_to_ocr = []
end

Instance Method Details

#contains_text?(pdf) ⇒ `Boolean`

Does a PDF have any text embedded?

Returns:

(Boolean)

# File 'lib/docsplit/text_extractor.rb', line 45

def contains_text?(pdf)
  fonts = `pdffonts #{ESCAPE[pdf]} 2>&1`
  !fonts.match(NO_TEXT_DETECTED)
end

#extract(pdfs, opts) ⇒ `Object`

Extract text from a list of PDFs.

# File 'lib/docsplit/text_extractor.rb', line 27

def extract(pdfs, opts)
  extract_options opts
  FileUtils.mkdir_p @output unless File.exist?(@output)
  [pdfs].flatten.each do |pdf|
    @pdf_name = File.basename(pdf, File.extname(pdf))
    pages = @pages == 'all' ? 1..Docsplit.extract_length(pdf) : @pages
    if @force_ocr || (!@forbid_ocr && !contains_text?(pdf))
      extract_from_ocr(pdf, pages)
    else
      extract_from_pdf(pdf, pages)
      if !@forbid_ocr && DEPENDENCIES[:tesseract] && !@pages_to_ocr.empty?
        extract_from_ocr(pdf, @pages_to_ocr)
      end
    end
  end
end

#extract_from_ocr(pdf, pages) ⇒ `Object`

Extract a page range worth of text from a PDF via OCR.

# File 'lib/docsplit/text_extractor.rb', line 57

def extract_from_ocr(pdf, pages)
  tempdir = Dir.mktmpdir
  base_path = File.join(@output, @pdf_name)
  escaped_pdf = ESCAPE[pdf]
  psm = @detect_orientation ? '-psm 1' : ''
  if pages
    pages.each do |page|
      tiff = "#{tempdir}/#{@pdf_name}_#{page}.tif"
      escaped_tiff = ESCAPE[tiff]
      file = "#{base_path}_#{page}"
      run "MAGICK_TMPDIR=#{tempdir} OMP_NUM_THREADS=2 gm convert -despeckle +adjoin #{MEMORY_ARGS} #{OCR_FLAGS} #{escaped_pdf}[#{page - 1}] #{escaped_tiff} 2>&1"
      run "tesseract #{escaped_tiff} #{ESCAPE[file]} -l #{@language} #{psm} 2>&1"
      clean_text(file + '.txt') if @clean_ocr
      FileUtils.remove_entry_secure tiff
    end
  else
    tiff = "#{tempdir}/#{@pdf_name}.tif"
    escaped_tiff = ESCAPE[tiff]
    run "MAGICK_TMPDIR=#{tempdir} OMP_NUM_THREADS=2 gm convert -despeckle #{MEMORY_ARGS} #{OCR_FLAGS} #{escaped_pdf} #{escaped_tiff} 2>&1"
    # if the user says don't do orientation detection or the plugin is not installed, set psm to 0
    run "tesseract #{escaped_tiff} #{base_path} -l #{@language} #{psm} 2>&1"
    clean_text(base_path + '.txt') if @clean_ocr
  end
ensure
  FileUtils.remove_entry_secure tempdir if File.exist?(tempdir)
end

#extract_from_pdf(pdf, pages) ⇒ `Object`

Extract a page range worth of text from a PDF, directly.

# File 'lib/docsplit/text_extractor.rb', line 51

def extract_from_pdf(pdf, pages)
  return extract_full(pdf) unless pages
  pages.each { |page| extract_page(pdf, page) }
end

Class: Docsplit::TextExtractor

Overview

Constant Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ TextExtractor

Instance Method Details

#contains_text?(pdf) ⇒ Boolean

#extract(pdfs, opts) ⇒ Object

#extract_from_ocr(pdf, pages) ⇒ Object

#extract_from_pdf(pdf, pages) ⇒ Object

#initialize ⇒ `TextExtractor`

#contains_text?(pdf) ⇒ `Boolean`

#extract(pdfs, opts) ⇒ `Object`

#extract_from_ocr(pdf, pages) ⇒ `Object`

#extract_from_pdf(pdf, pages) ⇒ `Object`