Module: Docsplit
- Extended by:
- TransparentPDFs
- Defined in:
- lib/docsplit.rb,
lib/docsplit/version.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/pdf_extractor.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb
Overview
The Docsplit module delegates to the Java PDF extractors.
Defined Under Namespace
Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, PdfExtractor, TextCleaner, TextExtractor
Constant Summary collapse
- ESCAPE =
->(x) { Shellwords.shellescape(x) }
- ROOT =
File.(File.dirname(__FILE__) + '/..')
- ESCAPED_ROOT =
ESCAPE[ROOT]
- METADATA_KEYS =
[:author, :date, :creator, :keywords, :producer, :subject, :title, :length].freeze
- GM_FORMATS =
['image/gif', 'image/jpeg', 'image/png', 'image/x-ms-bmp', 'image/svg+xml', 'image/tiff', 'image/x-portable-bitmap', 'application/postscript', 'image/x-portable-pixmap'].freeze
- DEPENDENCIES =
{ java: false, gm: false, pdftotext: false, pdftk: false, pdftailor: false, tesseract: false, osd: false }
- VERSION =
'0.7.9'.freeze
Class Method Summary collapse
-
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
-
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
- .extract_info(pdfs, opts = {}) ⇒ Object
-
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
-
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs.
-
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
Methods included from TransparentPDFs
Class Method Details
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
83 84 85 |
# File 'lib/docsplit.rb', line 83 def self.clean_text(text) TextCleaner.new.clean(text) end |
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
54 55 56 57 58 |
# File 'lib/docsplit.rb', line 54 def self.extract_images(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) opts[:pages] = normalize_value(opts[:pages]) if opts[:pages] ImageExtractor.new.extract(pdfs, opts) end |
.extract_info(pdfs, opts = {}) ⇒ Object
77 78 79 80 |
# File 'lib/docsplit.rb', line 77 def self.extract_info(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) InfoExtractor.new.extract_all(pdfs, opts) end |
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
42 43 44 45 |
# File 'lib/docsplit.rb', line 42 def self.extract_pages(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) PageExtractor.new.extract(pdfs, opts) end |
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.
62 63 64 |
# File 'lib/docsplit.rb', line 62 def self.extract_pdf(docs, opts = {}) PdfExtractor.new.extract(docs, opts) end |
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
48 49 50 51 |
# File 'lib/docsplit.rb', line 48 def self.extract_text(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) TextExtractor.new.extract(pdfs, opts) end |