Class: Slaw::Extract::Extractor

Inherits:
Object
  • Object
show all
Includes:
Logging
Defined in:
lib/slaw/extract/extractor.rb

Overview

Routines for extracting and cleaning up context from other formats, such as PDF.

You may need to set the location of the ‘pdftotext` binary.

On Mac OS X, use ‘brew install xpdf` or download from www.foolabs.com/xpdf/download.html

On Heroku, you’ll need to do some hoop jumping, see theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/

Constant Summary collapse

@@pdftotext_path =
"pdftotext"

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Logging

#logger

Constructor Details

#initializeExtractor

Returns a new instance of Extractor.



23
24
25
# File 'lib/slaw/extract/extractor.rb', line 23

def initialize
  @cleanser = Slaw::Parse::Cleanser.new
end

Instance Attribute Details

#cleanserObject

Object with text cleaning helpers



21
22
23
# File 'lib/slaw/extract/extractor.rb', line 21

def cleanser
  @cleanser
end

Class Method Details

.pdftotext_pathObject

Get location of the pdftotext executable for all instances.



131
132
133
# File 'lib/slaw/extract/extractor.rb', line 131

def self.pdftotext_path
  @@pdftotext_path
end

.pdftotext_path=(val) ⇒ Object

Set location of the pdftotext executable for all instances.



136
137
138
# File 'lib/slaw/extract/extractor.rb', line 136

def self.pdftotext_path=(val)
  @@pdftotext_path = val
end

Instance Method Details

#cleanup(text) ⇒ Object

Run general once-off cleanup of extracted text.



103
104
105
106
107
108
109
# File 'lib/slaw/extract/extractor.rb', line 103

def cleanup(text)
  text = @cleanser.cleanup(text)
  text = @cleanser.remove_empty_lines(text)
  text = @cleanser.reformat(text)

  text
end

#extract_from_file(filename) ⇒ String

Extract text from a file and run cleanup on it.

Parameters:

  • filename (String)

    filename to extract from

Returns:

  • (String)

    extracted text



32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/slaw/extract/extractor.rb', line 32

def extract_from_file(filename)
  mimetype = get_mimetype(filename)

  case mimetype && mimetype.type
  when 'application/pdf'
    extract_from_pdf(filename)
  when 'text/plain', nil
    extract_from_text(filename)
  else
    text = extract_via_tika(filename)
    if text.empty? or text.nil?
      raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
    end
    text
  end
end

#extract_from_pdf(filename) ⇒ String

Extract text from a PDF

Parameters:

  • filename (String)

    filename to extract from

Returns:

  • (String)

    extracted text



54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# File 'lib/slaw/extract/extractor.rb', line 54

def extract_from_pdf(filename)
  retried = false

  while true
    cmd = pdf_to_text_cmd(filename)
    logger.info("Executing: #{cmd}")
    stdout, status = Open3.capture2(*cmd)

    case status.exitstatus
    when 0
      return cleanup(stdout)
    when 3
      return nil if retried
      retried = true
      self.remove_pdf_password(filename)
    else
      return nil
    end
  end
end

#extract_from_text(filename) ⇒ Object



84
85
86
# File 'lib/slaw/extract/extractor.rb', line 84

def extract_from_text(filename)
  cleanup(File.read(filename))
end

#extract_via_tika(filename) ⇒ Object

Extract text from filename by sending it to apache tika tika.apache.org/



90
91
92
93
94
95
96
97
98
99
100
# File 'lib/slaw/extract/extractor.rb', line 90

def extract_via_tika(filename)
  # the Yomu gem falls over when trying to write large amounts of data
  # the JVM stdin, so we manually call java ourselves, relying on yomu
  # to supply the gem
  require 'slaw/extract/yomu_patch'
  logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.")

  text = Yomu.text_from_file(filename)
  logger.info("Tika returned #{text.length} bytes")
  text
end

#get_mimetype(filename) ⇒ Object



125
126
127
128
# File 'lib/slaw/extract/extractor.rb', line 125

def get_mimetype(filename)
  File.open(filename) { |f| MimeMagic.by_magic(f) } \
    || MimeMagic.by_path(filename)
end

#pdf_to_text_cmd(filename) ⇒ Array<String>

Build a command for the external PDF-to-text utility.

Parameters:

  • filename (String)

    the pdf file

Returns:

  • (Array<String>)

    command and params to execute



80
81
82
# File 'lib/slaw/extract/extractor.rb', line 80

def pdf_to_text_cmd(filename)
  [Extractor.pdftotext_path, "-enc", "UTF-8", filename, "-"]
end

#remove_pdf_password(filename) ⇒ Object



111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/slaw/extract/extractor.rb', line 111

def remove_pdf_password(filename)
  file = Tempfile.new('steno')
  begin
    logger.info("Trying to remove password from #{filename}")
    cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ")
    logger.info("Executing: #{cmd}")
    Open3.capture2(*cmd)
    FileUtils.move(file.path, filename)
  ensure
    file.close
    file.unlink
  end
end