Class: RDig::ContentExtractors::WordContentExtractor
- Inherits:
-
ContentExtractor
- Object
- ContentExtractor
- RDig::ContentExtractors::WordContentExtractor
- Includes:
- ExternalAppHelper
- Defined in:
- lib/rdig/content_extractors/doc.rb
Overview
Extract text from word documents
Requires the wvHtml utility (on debian and friends do ‘apt-get install wv’)
Instance Method Summary collapse
-
#initialize(config) ⇒ WordContentExtractor
constructor
A new instance of WordContentExtractor.
- #process(content) ⇒ Object
Methods included from ExternalAppHelper
Methods inherited from ContentExtractor
#can_do, extractor_instances, extractors, inherited, process
Constructor Details
#initialize(config) ⇒ WordContentExtractor
Returns a new instance of WordContentExtractor.
11 12 13 14 15 16 17 18 19 20 21 22 23 |
# File 'lib/rdig/content_extractors/doc.rb', line 11 def initialize(config) super(config) @wvhtml = 'wvHtml' @pattern = /^application\/msword/ # html extractor for parsing wvHtml output @html_extractor = HpricotContentExtractor.new(OpenStruct.new( :hpricot => OpenStruct.new( :content_tag_selector => 'body', :title_tag_selector => 'title' ))) # TODO: better: if $?.exitstatus == 127 (not found) @available = %x{#{@wvhtml} -h 2>&1} =~ /Dom Lachowicz/ end |
Instance Method Details
#process(content) ⇒ Object
25 26 27 28 29 30 31 |
# File 'lib/rdig/content_extractors/doc.rb', line 25 def process(content) result = {} as_file(content) do |file| result = @html_extractor.process(%x{#{@wvhtml} --charset=UTF-8 '#{file.path}' -}) end return result || {} end |