Class: Slaw::Extract::Extractor

Inherits:

Object

Object
Slaw::Extract::Extractor

show all

Includes:: Logging

Defined in:: lib/slaw/extract/extractor.rb

Overview

Routines for extracting and cleaning up context from other formats, such as PDF.

You may need to set the location of the ‘pdftotext` binary.

On Mac OS X, use ‘brew install xpdf` or download from www.foolabs.com/xpdf/download.html

On Heroku, you’ll need to do some hoop jumping, see theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/

Constant Summary collapse

@@pdftotext_path =

"pdftotext"

Class Method Summary collapse

.pdftotext_path ⇒ Object

Get location of the pdftotext executable for all instances.
.pdftotext_path=(val) ⇒ Object

Set location of the pdftotext executable for all instances.

Instance Method Summary collapse

#extract_from_file(filename) ⇒ String

Extract text from a file.
#extract_from_html(filename) ⇒ Object
#extract_from_pdf(filename) ⇒ String

Extract text from a PDF.
#extract_from_text(filename) ⇒ Object
#extract_via_tika(filename) ⇒ Object

Extract text from filename by sending it to apache tika tika.apache.org/.
#get_mimetype(filename) ⇒ Object
#html_to_text(html) ⇒ Object
#pdf_to_text_cmd(filename) ⇒ Array<String>

Build a command for the external PDF-to-text utility.
#remove_pdf_password(filename) ⇒ Object

Methods included from Logging

#logger

Class Method Details

.pdftotext_path ⇒ `Object`

Get location of the pdftotext executable for all instances.



131
132
133

# File 'lib/slaw/extract/extractor.rb', line 131

def self.pdftotext_path
  @@pdftotext_path
end

.pdftotext_path=(val) ⇒ `Object`

Set location of the pdftotext executable for all instances.



136
137
138

# File 'lib/slaw/extract/extractor.rb', line 136

def self.pdftotext_path=(val)
  @@pdftotext_path = val
end

Instance Method Details

#extract_from_file(filename) ⇒ `String`

Extract text from a file.

Parameters:

filename (String) —

filename to extract from

Returns:

(String) —

extracted text

# File 'lib/slaw/extract/extractor.rb', line 25

def extract_from_file(filename)
  mimetype = get_mimetype(filename)

  case mimetype && mimetype.type
  when 'application/pdf'
    extract_from_pdf(filename)
  when 'text/html', nil
    extract_from_html(filename)
  when 'text/plain', nil
    extract_from_text(filename)
  else
    text = extract_via_tika(filename)
    if text.empty? or text.nil?
      raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
    end
    text
  end
end

#extract_from_html(filename) ⇒ `Object`



83
84
85

# File 'lib/slaw/extract/extractor.rb', line 83

def extract_from_html(filename)
  html_to_text(File.read(filename))
end

#extract_from_pdf(filename) ⇒ `String`

Extract text from a PDF

Parameters:

filename (String) —

filename to extract from

Returns:

(String) —

extracted text

# File 'lib/slaw/extract/extractor.rb', line 49

def extract_from_pdf(filename)
  retried = false

  while true
    cmd = pdf_to_text_cmd(filename)
    logger.info("Executing: #{cmd}")
    stdout, status = Open3.capture2(*cmd)

    case status.exitstatus
    when 0
      return stdout
    when 3
      return nil if retried
      retried = true
      self.remove_pdf_password(filename)
    else
      return nil
    end
  end
end

#extract_from_text(filename) ⇒ `Object`



79
80
81

# File 'lib/slaw/extract/extractor.rb', line 79

def extract_from_text(filename)
  File.read(filename)
end

#extract_via_tika(filename) ⇒ `Object`

Extract text from filename by sending it to apache tika tika.apache.org/

# File 'lib/slaw/extract/extractor.rb', line 89

def extract_via_tika(filename)
  # the Yomu gem falls over when trying to write large amounts of data
  # the JVM stdin, so we manually call java ourselves, relying on yomu
  # to supply the gem
  require 'slaw/extract/yomu_patch'
  logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.")

  html = Yomu.text_from_file(filename)
  logger.info("Tika returned #{html.length} bytes")
  # transform html into text
  html_to_text(html)
end

#get_mimetype(filename) ⇒ `Object`

# File 'lib/slaw/extract/extractor.rb', line 125

def get_mimetype(filename)
  File.open(filename) { |f| MimeMagic.by_magic(f) } \
    || MimeMagic.by_path(filename)
end

#html_to_text(html) ⇒ `Object`

# File 'lib/slaw/extract/extractor.rb', line 102

def html_to_text(html)
  here = File.dirname(__FILE__)
  xslt = Nokogiri::XSLT(File.open(File.join([here, 'html_to_akn_text.xsl'])))

  text = xslt.transform(Nokogiri::HTML(html)).to_s
  # remove XML encoding at top
  text.sub(/^<\?xml [^>]*>/, '')
end

#pdf_to_text_cmd(filename) ⇒ `Array<String>`

Build a command for the external PDF-to-text utility.

Parameters:

filename (String) —

the pdf file

Returns:

(Array<String>) —

command and params to execute



75
76
77

# File 'lib/slaw/extract/extractor.rb', line 75

def pdf_to_text_cmd(filename)
  [Extractor.pdftotext_path, "-enc", "UTF-8", filename, "-"]
end

#remove_pdf_password(filename) ⇒ `Object`

# File 'lib/slaw/extract/extractor.rb', line 111

def remove_pdf_password(filename)
  file = Tempfile.new('steno')
  begin
    logger.info("Trying to remove password from #{filename}")
    cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ")
    logger.info("Executing: #{cmd}")
    Open3.capture2(*cmd)
    FileUtils.move(file.path, filename)
  ensure
    file.close
    file.unlink
  end
end

Class: Slaw::Extract::Extractor

Overview

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Logging

Class Method Details

.pdftotext_path ⇒ Object

.pdftotext_path=(val) ⇒ Object

Instance Method Details

#extract_from_file(filename) ⇒ String

#extract_from_html(filename) ⇒ Object

#extract_from_pdf(filename) ⇒ String

#extract_from_text(filename) ⇒ Object

#extract_via_tika(filename) ⇒ Object

#get_mimetype(filename) ⇒ Object

#html_to_text(html) ⇒ Object

#pdf_to_text_cmd(filename) ⇒ Array<String>

#remove_pdf_password(filename) ⇒ Object

.pdftotext_path ⇒ `Object`

.pdftotext_path=(val) ⇒ `Object`

#extract_from_file(filename) ⇒ `String`

#extract_from_html(filename) ⇒ `Object`

#extract_from_pdf(filename) ⇒ `String`

#extract_from_text(filename) ⇒ `Object`

#extract_via_tika(filename) ⇒ `Object`

#get_mimetype(filename) ⇒ `Object`

#html_to_text(html) ⇒ `Object`

#pdf_to_text_cmd(filename) ⇒ `Array<String>`

#remove_pdf_password(filename) ⇒ `Object`