Class: ChupaText::Extractor

Inherits:
Object
  • Object
show all
Includes:
Loggable
Defined in:
lib/chupa-text/extractor.rb

Instance Method Summary collapse

Constructor Details

#initializeExtractor

Returns a new instance of Extractor.



24
25
26
# File 'lib/chupa-text/extractor.rb', line 24

def initialize
  @decomposers = []
end

Instance Method Details

#add_decomposer(decomposer) ⇒ Object



43
44
45
# File 'lib/chupa-text/extractor.rb', line 43

def add_decomposer(decomposer)
  @decomposers << decomposer
end

#apply_configuration(configuration) ⇒ void

This method returns an undefined value.

Sets the extractor up by the configuration. It adds decomposers enabled in the configuration.

Parameters:

  • configuration (Configuration)

    The configuration to be applied.



35
36
37
38
39
40
41
# File 'lib/chupa-text/extractor.rb', line 35

def apply_configuration(configuration)
  decomposers = Decomposers.create(Decomposer.registry,
                                   configuration.decomposer)
  decomposers.each do |decomposer|
    add_decomposer(decomposer)
  end
end

#extract(input) {|text_data| ... } ⇒ void

This method returns an undefined value.

Extracts texts from input. Each extracted text is passes to the given block.

Parameters:

  • input (Data, String)

    The input to be extracted texts. If input is String, it is treated as the local file path or URI of input data.

Yields:

  • (text_data)

    Gives extracted text data to the block. The block may be called zero or more times.

Yield Parameters:

  • text_data (Data)

    The extracted text data. You can get text data by text_data.body.



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/chupa-text/extractor.rb', line 60

def extract(input)
  targets = [ensure_data(input)]
  until targets.empty?
    target = targets.shift
    debug do
      "#{log_tag}[extract][target] <#{target.uri}>:<#{target.mime_type}>"
    end
    decomposer = find_decomposer(target)
    if decomposer.nil?
      if target.text_plain?
        debug {"#{log_tag}[extract][text-plain]"}
        yield(target)
        next
      else
        debug {"#{log_tag}[extract][decomposer] not found"}
        yield(target) if target.text?
        next
      end
    end
    debug {"#{log_tag}[extract][decomposer] #{decomposer.class}"}
    decomposer.decompose(target) do |decomposed|
      debug do
        "#{log_tag}[extract][decomposed] " +
          "#{decomposer.class}: " +
          "<#{target.uri}>: " +
          "<#{target.mime_type}> -> <#{decomposed.mime_type}>"
      end
      targets.push(decomposed)
    end
  end
end