Class: RDig::ContentExtractors::WordContentExtractor

Inherits:
ContentExtractor show all
Includes:
ExternalAppHelper
Defined in:
lib/rdig/content_extractors/doc.rb

Overview

Extract text from word documents

Requires the wvHtml utility (on debian and friends do ‘apt-get install wv’)

Instance Method Summary collapse

Methods included from ExternalAppHelper

#as_file, #can_do

Methods inherited from ContentExtractor

#can_do, extractor_instances, extractors, inherited, process

Constructor Details

#initialize(config) ⇒ WordContentExtractor

Returns a new instance of WordContentExtractor.



11
12
13
14
15
16
17
18
19
20
21
22
23
# File 'lib/rdig/content_extractors/doc.rb', line 11

def initialize(config)
  super(config)
  @wvhtml = 'wvHtml'
  @pattern = /^application\/msword/
  # html extractor for parsing wvHtml output
  @html_extractor = HpricotContentExtractor.new(OpenStruct.new(
      :hpricot => OpenStruct.new(
        :content_tag_selector => 'body',
        :title_tag_selector   => 'title'
      )))
  # TODO: better: if $?.exitstatus == 127 (not found)
  @available = %x{#{@wvhtml} -h 2>&1} =~ /Dom Lachowicz/
end

Instance Method Details

#process(content) ⇒ Object



25
26
27
28
29
30
31
# File 'lib/rdig/content_extractors/doc.rb', line 25

def process(content)
  result = {}
  as_file(content) do |file|  
    result = @html_extractor.process(%x{#{@wvhtml} --charset=UTF-8 '#{file.path}' -})
  end
  return result || {}
end