Module: Docsplit

Extended by:: TransparentPDFs

Defined in:: lib/docsplit.rb,
lib/docsplit/version.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/pdf_extractor.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb

Overview

The Docsplit module delegates to the Java PDF extractors.

Defined Under Namespace

Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, PdfExtractor, TextCleaner, TextExtractor

Constant Summary collapse

ESCAPE =

->(x) { Shellwords.shellescape(x) }

ROOT =

File.expand_path(File.dirname(__FILE__) + '/..')

ESCAPED_ROOT =

ESCAPE[ROOT]

METADATA_KEYS =

[:author, :date, :creator, :keywords, :producer, :subject, :title, :length].freeze

GM_FORMATS =

['image/gif', 'image/jpeg', 'image/png', 'image/x-ms-bmp', 'image/svg+xml', 'image/tiff', 'image/x-portable-bitmap', 'application/postscript', 'image/x-portable-pixmap'].freeze

DEPENDENCIES =

{ java: false, gm: false, pdftotext: false, pdftk: false, pdftailor: false, tesseract: false, osd: false }

VERSION =

'0.7.9'.freeze

Class Method Summary collapse

.clean_text(text) ⇒ Object

Utility method to clean OCR’d text with garbage characters.
.extract_images(pdfs, opts = {}) ⇒ Object

Use the ExtractImages Java class to rasterize a PDF into each page’s image.
.extract_info(pdfs, opts = {}) ⇒ Object
.extract_pages(pdfs, opts = {}) ⇒ Object

Use the ExtractPages Java class to burst a PDF into single pages.
.extract_pdf(docs, opts = {}) ⇒ Object

Use JODCConverter to extract the documents as PDFs.
.extract_text(pdfs, opts = {}) ⇒ Object

Use the ExtractText Java class to write out all embedded text.

Methods included from TransparentPDFs

ensure_pdfs, is_pdf?

Class Method Details

.clean_text(text) ⇒ `Object`

Utility method to clean OCR’d text with garbage characters.



83
84
85

# File 'lib/docsplit.rb', line 83

def self.clean_text(text)
  TextCleaner.new.clean(text)
end

.extract_images(pdfs, opts = {}) ⇒ `Object`

Use the ExtractImages Java class to rasterize a PDF into each page’s image.

# File 'lib/docsplit.rb', line 54

def self.extract_images(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  opts[:pages] = normalize_value(opts[:pages]) if opts[:pages]
  ImageExtractor.new.extract(pdfs, opts)
end

.extract_info(pdfs, opts = {}) ⇒ `Object`

# File 'lib/docsplit.rb', line 77

def self.extract_info(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  InfoExtractor.new.extract_all(pdfs, opts)
end

.extract_pages(pdfs, opts = {}) ⇒ `Object`

Use the ExtractPages Java class to burst a PDF into single pages.

# File 'lib/docsplit.rb', line 42

def self.extract_pages(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  PageExtractor.new.extract(pdfs, opts)
end

.extract_pdf(docs, opts = {}) ⇒ `Object`

Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.



62
63
64

# File 'lib/docsplit.rb', line 62

def self.extract_pdf(docs, opts = {})
  PdfExtractor.new.extract(docs, opts)
end

.extract_text(pdfs, opts = {}) ⇒ `Object`

Use the ExtractText Java class to write out all embedded text.

# File 'lib/docsplit.rb', line 48

def self.extract_text(pdfs, opts = {})
  pdfs = ensure_pdfs(pdfs)
  TextExtractor.new.extract(pdfs, opts)
end

Module: Docsplit

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Methods included from TransparentPDFs

Class Method Details

.clean_text(text) ⇒ Object

.extract_images(pdfs, opts = {}) ⇒ Object

.extract_info(pdfs, opts = {}) ⇒ Object

.extract_pages(pdfs, opts = {}) ⇒ Object