Class: Docsplit::TextExtractor
- Inherits:
-
Object
- Object
- Docsplit::TextExtractor
- Defined in:
- lib/docsplit/text_extractor.rb
Overview
Delegates to pdftotext and tesseract in order to extract text from PDF documents. The ‘–ocr` and `–no-ocr` flags can be used to force or forbid OCR extraction, but by default the heuristic works like this:
* Check for the presence of fonts in the PDF. If no fonts are detected,
OCR is used automatically.
* Extract the text of each page with **pdftotext**, if the page has less
than 100 bytes of text (a scanned image page, or a page that just
contains a filename and a page number), then add it to the list of
`@pages_to_ocr`.
* Re-OCR each page in the `@pages_to_ocr` list at the end.
Constant Summary collapse
- NO_TEXT_DETECTED =
/---------\n\Z/
- OCR_FLAGS =
'-density 400x400 -colorspace GRAY'.freeze
- MEMORY_ARGS =
'-limit memory 256MiB -limit map 512MiB'.freeze
- MIN_TEXT_PER_PAGE =
in bytes
100
Instance Method Summary collapse
-
#contains_text?(pdf) ⇒ Boolean
Does a PDF have any text embedded?.
-
#extract(pdfs, opts) ⇒ Object
Extract text from a list of PDFs.
-
#extract_from_ocr(pdf, pages) ⇒ Object
Extract a page range worth of text from a PDF via OCR.
-
#extract_from_pdf(pdf, pages) ⇒ Object
Extract a page range worth of text from a PDF, directly.
-
#initialize ⇒ TextExtractor
constructor
A new instance of TextExtractor.
Constructor Details
#initialize ⇒ TextExtractor
Returns a new instance of TextExtractor.
22 23 24 |
# File 'lib/docsplit/text_extractor.rb', line 22 def initialize @pages_to_ocr = [] end |
Instance Method Details
#contains_text?(pdf) ⇒ Boolean
Does a PDF have any text embedded?
45 46 47 48 |
# File 'lib/docsplit/text_extractor.rb', line 45 def contains_text?(pdf) fonts = `pdffonts #{ESCAPE[pdf]} 2>&1` !fonts.match(NO_TEXT_DETECTED) end |
#extract(pdfs, opts) ⇒ Object
Extract text from a list of PDFs.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/docsplit/text_extractor.rb', line 27 def extract(pdfs, opts) opts FileUtils.mkdir_p @output unless File.exist?(@output) [pdfs].flatten.each do |pdf| @pdf_name = File.basename(pdf, File.extname(pdf)) pages = @pages == 'all' ? 1..Docsplit.extract_length(pdf) : @pages if @force_ocr || (!@forbid_ocr && !contains_text?(pdf)) extract_from_ocr(pdf, pages) else extract_from_pdf(pdf, pages) if !@forbid_ocr && DEPENDENCIES[:tesseract] && !@pages_to_ocr.empty? extract_from_ocr(pdf, @pages_to_ocr) end end end end |
#extract_from_ocr(pdf, pages) ⇒ Object
Extract a page range worth of text from a PDF via OCR.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
# File 'lib/docsplit/text_extractor.rb', line 57 def extract_from_ocr(pdf, pages) tempdir = Dir.mktmpdir base_path = File.join(@output, @pdf_name) escaped_pdf = ESCAPE[pdf] psm = @detect_orientation ? '-psm 1' : '' if pages pages.each do |page| tiff = "#{tempdir}/#{@pdf_name}_#{page}.tif" escaped_tiff = ESCAPE[tiff] file = "#{base_path}_#{page}" run "MAGICK_TMPDIR=#{tempdir} OMP_NUM_THREADS=2 gm convert -despeckle +adjoin #{MEMORY_ARGS} #{OCR_FLAGS} #{escaped_pdf}[#{page - 1}] #{escaped_tiff} 2>&1" run "tesseract #{escaped_tiff} #{ESCAPE[file]} -l #{@language} #{psm} 2>&1" clean_text(file + '.txt') if @clean_ocr FileUtils.remove_entry_secure tiff end else tiff = "#{tempdir}/#{@pdf_name}.tif" escaped_tiff = ESCAPE[tiff] run "MAGICK_TMPDIR=#{tempdir} OMP_NUM_THREADS=2 gm convert -despeckle #{MEMORY_ARGS} #{OCR_FLAGS} #{escaped_pdf} #{escaped_tiff} 2>&1" # if the user says don't do orientation detection or the plugin is not installed, set psm to 0 run "tesseract #{escaped_tiff} #{base_path} -l #{@language} #{psm} 2>&1" clean_text(base_path + '.txt') if @clean_ocr end ensure FileUtils.remove_entry_secure tempdir if File.exist?(tempdir) end |
#extract_from_pdf(pdf, pages) ⇒ Object
Extract a page range worth of text from a PDF, directly.
51 52 53 54 |
# File 'lib/docsplit/text_extractor.rb', line 51 def extract_from_pdf(pdf, pages) return extract_full(pdf) unless pages pages.each { |page| extract_page(pdf, page) } end |