Class: Slaw::Extract::Extractor
- Inherits:
-
Object
- Object
- Slaw::Extract::Extractor
- Includes:
- Logging
- Defined in:
- lib/slaw/extract/extractor.rb
Overview
Routines for extracting and cleaning up context from other formats, such as PDF.
You may need to set the location of the ‘pdftotext` binary.
On Mac OS X, use ‘brew install xpdf` or download from www.foolabs.com/xpdf/download.html
On Heroku, you’ll need to do some hoop jumping, see theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/
Constant Summary collapse
- @@pdftotext_path =
"pdftotext"
Class Method Summary collapse
-
.pdftotext_path ⇒ Object
Get location of the pdftotext executable for all instances.
-
.pdftotext_path=(val) ⇒ Object
Set location of the pdftotext executable for all instances.
Instance Method Summary collapse
-
#extract_from_file(filename) ⇒ String
Extract text from a file.
- #extract_from_html(filename) ⇒ Object
-
#extract_from_pdf(filename) ⇒ String
Extract text from a PDF.
- #extract_from_text(filename) ⇒ Object
-
#extract_via_tika(filename) ⇒ Object
Extract text from
filenameby sending it to apache tika tika.apache.org/. - #get_mimetype(filename) ⇒ Object
- #html_to_text(html) ⇒ Object
-
#pdf_to_text_cmd(filename) ⇒ Array<String>
Build a command for the external PDF-to-text utility.
- #remove_pdf_password(filename) ⇒ Object
Methods included from Logging
Class Method Details
.pdftotext_path ⇒ Object
Get location of the pdftotext executable for all instances.
131 132 133 |
# File 'lib/slaw/extract/extractor.rb', line 131 def self.pdftotext_path @@pdftotext_path end |
.pdftotext_path=(val) ⇒ Object
Set location of the pdftotext executable for all instances.
136 137 138 |
# File 'lib/slaw/extract/extractor.rb', line 136 def self.pdftotext_path=(val) @@pdftotext_path = val end |
Instance Method Details
#extract_from_file(filename) ⇒ String
Extract text from a file.
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/slaw/extract/extractor.rb', line 25 def extract_from_file(filename) mimetype = get_mimetype(filename) case mimetype && mimetype.type when 'application/pdf' extract_from_pdf(filename) when 'text/html', nil extract_from_html(filename) when 'text/plain', nil extract_from_text(filename) else text = extract_via_tika(filename) if text.empty? or text.nil? raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}") end text end end |
#extract_from_html(filename) ⇒ Object
83 84 85 |
# File 'lib/slaw/extract/extractor.rb', line 83 def extract_from_html(filename) html_to_text(File.read(filename)) end |
#extract_from_pdf(filename) ⇒ String
Extract text from a PDF
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/slaw/extract/extractor.rb', line 49 def extract_from_pdf(filename) retried = false while true cmd = pdf_to_text_cmd(filename) logger.info("Executing: #{cmd}") stdout, status = Open3.capture2(*cmd) case status.exitstatus when 0 return stdout when 3 return nil if retried retried = true self.remove_pdf_password(filename) else return nil end end end |
#extract_from_text(filename) ⇒ Object
79 80 81 |
# File 'lib/slaw/extract/extractor.rb', line 79 def extract_from_text(filename) File.read(filename) end |
#extract_via_tika(filename) ⇒ Object
Extract text from filename by sending it to apache tika tika.apache.org/
89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/slaw/extract/extractor.rb', line 89 def extract_via_tika(filename) # the Yomu gem falls over when trying to write large amounts of data # the JVM stdin, so we manually call java ourselves, relying on yomu # to supply the gem require 'slaw/extract/yomu_patch' logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.") html = Yomu.text_from_file(filename) logger.info("Tika returned #{html.length} bytes") # transform html into text html_to_text(html) end |
#get_mimetype(filename) ⇒ Object
125 126 127 128 |
# File 'lib/slaw/extract/extractor.rb', line 125 def get_mimetype(filename) File.open(filename) { |f| MimeMagic.by_magic(f) } \ || MimeMagic.by_path(filename) end |
#html_to_text(html) ⇒ Object
102 103 104 105 106 107 108 109 |
# File 'lib/slaw/extract/extractor.rb', line 102 def html_to_text(html) here = File.dirname(__FILE__) xslt = Nokogiri::XSLT(File.open(File.join([here, 'html_to_akn_text.xsl']))) text = xslt.transform(Nokogiri::HTML(html)).to_s # remove XML encoding at top text.sub(/^<\?xml [^>]*>/, '') end |
#pdf_to_text_cmd(filename) ⇒ Array<String>
Build a command for the external PDF-to-text utility.
75 76 77 |
# File 'lib/slaw/extract/extractor.rb', line 75 def pdf_to_text_cmd(filename) [Extractor.pdftotext_path, "-enc", "UTF-8", filename, "-"] end |
#remove_pdf_password(filename) ⇒ Object
111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/slaw/extract/extractor.rb', line 111 def remove_pdf_password(filename) file = Tempfile.new('steno') begin logger.info("Trying to remove password from #{filename}") cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ") logger.info("Executing: #{cmd}") Open3.capture2(*cmd) FileUtils.move(file.path, filename) ensure file.close file.unlink end end |