Class: TermExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/term-extractor.rb,
lib/term-extractor/nlp.rb

Overview

A class for extracting useful snippets of text from a document

Defined Under Namespace

Classes: NLP, TermContext

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(models = File.dirname(__FILE__) + "/../models") {|_self| ... } ⇒ TermExtractor

Returns a new instance of TermExtractor.

Yields:

  • (_self)

Yield Parameters:

  • _self (TermExtractor)

    the object that the method was called on



20
21
22
23
24
25
26
27
28
29
30
31
32
# File 'lib/term-extractor.rb', line 20

def initialize(models = File.dirname(__FILE__) + "/../models")
  @nlp = NLP.new(models)

  # Empirically, terms longer than about 5 words seem to be either
  # too specific to be useful or very noisy.
  @max_term_length = 4 


  self.remove_urls = true
  self.remove_paths = true

  yield self if block_given?
end

Instance Attribute Details

#max_term_lengthObject

Returns the value of attribute max_term_length.



18
19
20
# File 'lib/term-extractor.rb', line 18

def max_term_length
  @max_term_length
end

#nlpObject

Returns the value of attribute nlp.



18
19
20
# File 'lib/term-extractor.rb', line 18

def nlp
  @nlp
end

#remove_pathsObject

Returns the value of attribute remove_paths.



18
19
20
# File 'lib/term-extractor.rb', line 18

def remove_paths
  @remove_paths
end

#remove_urlsObject

Returns the value of attribute remove_urls.



18
19
20
# File 'lib/term-extractor.rb', line 18

def remove_urls
  @remove_urls
end

Class Method Details

.allowed_term?(p) ⇒ Boolean

Final post filter on terms to determine if they’re allowed.

Returns:

  • (Boolean)


229
230
231
232
233
# File 'lib/term-extractor.rb', line 229

def self.allowed_term?(p)
  return false if p.to_s =~ /^[^a-zA-Z]*$/ # We don't allow things which are just sequences of numbers
  return false if p.to_s.length > 255
  true
end

.recombobulate_term(term) ⇒ Object

Take a sequence of tokens and turn them back into a term.



236
237
238
239
240
241
# File 'lib/term-extractor.rb', line 236

def self.recombobulate_term(term)
  term = term.join(" ")
  term.gsub!(/ '/, "'")
  term.gsub!(/ \./, ".")
  term
end

Instance Method Details

#extract_terms_from_sentence(sentence) ⇒ Object

Extract all terms in a given sentence.



211
212
213
# File 'lib/term-extractor.rb', line 211

def extract_terms_from_sentence(sentence)
  TermContext.new(self, sentence).terms
end

#extract_terms_from_text(text) ⇒ Object



215
216
217
218
219
220
221
222
223
224
225
226
# File 'lib/term-extractor.rb', line 215

def extract_terms_from_text(text)
  if block_given?
    nlp.sentences(text).each_with_index do |s, i|
      terms = extract_terms_from_sentence(s);
      terms.each{|p| p.sentence = i; yield(p) }
    end
  else
    results = []
    extract_terms_from_text(text){ |p| results << p }
    results
  end
end