Class: Slaw::Extract::Extractor
- Inherits:
-
Object
- Object
- Slaw::Extract::Extractor
- Includes:
- Logging
- Defined in:
- lib/slaw/extract/extractor.rb
Overview
Routines for extracting and cleaning up context from other formats, such as PDF.
You may need to set the location of the ‘pdftotext` binary.
On Mac OS X, use ‘brew install xpdf` or download from www.foolabs.com/xpdf/download.html
On Heroku, you’ll need to do some hoop jumping, see theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/
Constant Summary collapse
- @@pdftotext_path =
"pdftotext"
Instance Attribute Summary collapse
-
#cleanser ⇒ Object
Object with text cleaning helpers.
Class Method Summary collapse
-
.pdftotext_path ⇒ Object
Get location of the pdftotext executable for all instances.
-
.pdftotext_path=(val) ⇒ Object
Set location of the pdftotext executable for all instances.
Instance Method Summary collapse
-
#cleanup(text) ⇒ Object
Run general once-off cleanup of extracted text.
-
#extract_from_file(filename) ⇒ String
Extract text from a file and run cleanup on it.
-
#extract_from_pdf(filename) ⇒ String
Extract text from a PDF.
- #extract_from_text(filename) ⇒ Object
-
#extract_via_tika(filename) ⇒ Object
Extract text from
filenameby sending it to apache tika tika.apache.org/. - #get_mimetype(filename) ⇒ Object
-
#initialize ⇒ Extractor
constructor
A new instance of Extractor.
-
#pdf_to_text_cmd(filename) ⇒ Array<String>
Build a command for the external PDF-to-text utility.
- #remove_pdf_password(filename) ⇒ Object
Methods included from Logging
Constructor Details
Instance Attribute Details
#cleanser ⇒ Object
Object with text cleaning helpers
21 22 23 |
# File 'lib/slaw/extract/extractor.rb', line 21 def cleanser @cleanser end |
Class Method Details
.pdftotext_path ⇒ Object
Get location of the pdftotext executable for all instances.
131 132 133 |
# File 'lib/slaw/extract/extractor.rb', line 131 def self.pdftotext_path @@pdftotext_path end |
.pdftotext_path=(val) ⇒ Object
Set location of the pdftotext executable for all instances.
136 137 138 |
# File 'lib/slaw/extract/extractor.rb', line 136 def self.pdftotext_path=(val) @@pdftotext_path = val end |
Instance Method Details
#cleanup(text) ⇒ Object
Run general once-off cleanup of extracted text.
103 104 105 106 107 108 109 |
# File 'lib/slaw/extract/extractor.rb', line 103 def cleanup(text) text = @cleanser.cleanup(text) text = @cleanser.remove_empty_lines(text) text = @cleanser.reformat(text) text end |
#extract_from_file(filename) ⇒ String
Extract text from a file and run cleanup on it.
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/slaw/extract/extractor.rb', line 32 def extract_from_file(filename) mimetype = get_mimetype(filename) case mimetype && mimetype.type when 'application/pdf' extract_from_pdf(filename) when 'text/plain', nil extract_from_text(filename) else text = extract_via_tika(filename) if text.empty? or text.nil? raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}") end text end end |
#extract_from_pdf(filename) ⇒ String
Extract text from a PDF
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# File 'lib/slaw/extract/extractor.rb', line 54 def extract_from_pdf(filename) retried = false while true cmd = pdf_to_text_cmd(filename) logger.info("Executing: #{cmd}") stdout, status = Open3.capture2(*cmd) case status.exitstatus when 0 return cleanup(stdout) when 3 return nil if retried retried = true self.remove_pdf_password(filename) else return nil end end end |
#extract_from_text(filename) ⇒ Object
84 85 86 |
# File 'lib/slaw/extract/extractor.rb', line 84 def extract_from_text(filename) cleanup(File.read(filename)) end |
#extract_via_tika(filename) ⇒ Object
Extract text from filename by sending it to apache tika tika.apache.org/
90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/slaw/extract/extractor.rb', line 90 def extract_via_tika(filename) # the Yomu gem falls over when trying to write large amounts of data # the JVM stdin, so we manually call java ourselves, relying on yomu # to supply the gem require 'slaw/extract/yomu_patch' logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.") text = Yomu.text_from_file(filename) logger.info("Tika returned #{text.length} bytes") text end |
#get_mimetype(filename) ⇒ Object
125 126 127 128 |
# File 'lib/slaw/extract/extractor.rb', line 125 def get_mimetype(filename) File.open(filename) { |f| MimeMagic.by_magic(f) } \ || MimeMagic.by_path(filename) end |
#pdf_to_text_cmd(filename) ⇒ Array<String>
Build a command for the external PDF-to-text utility.
80 81 82 |
# File 'lib/slaw/extract/extractor.rb', line 80 def pdf_to_text_cmd(filename) [Extractor.pdftotext_path, "-enc", "UTF-8", filename, "-"] end |
#remove_pdf_password(filename) ⇒ Object
111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/slaw/extract/extractor.rb', line 111 def remove_pdf_password(filename) file = Tempfile.new('steno') begin logger.info("Trying to remove password from #{filename}") cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ") logger.info("Executing: #{cmd}") Open3.capture2(*cmd) FileUtils.move(file.path, filename) ensure file.close file.unlink end end |