Class: TermExtractor

Inherits:

Object

Object
TermExtractor

show all

Defined in:: lib/term-extractor.rb,
lib/term-extractor/nlp.rb

Overview

A class for extracting useful snippets of text from a document

Defined Under Namespace

Classes: NLP, TermContext

Instance Attribute Summary collapse

#max_term_length ⇒ Object

Returns the value of attribute max_term_length.
#nlp ⇒ Object

Returns the value of attribute nlp.
#remove_paths ⇒ Object

Returns the value of attribute remove_paths.
#remove_urls ⇒ Object

Returns the value of attribute remove_urls.

Class Method Summary collapse

.allowed_term?(p) ⇒ Boolean

Final post filter on terms to determine if they’re allowed.
.recombobulate_term(term) ⇒ Object

Take a sequence of tokens and turn them back into a term.

Instance Method Summary collapse

#extract_terms_from_sentence(sentence) ⇒ Object

Extract all terms in a given sentence.
#extract_terms_from_text(text) ⇒ Object
#initialize(models = File.dirname(__FILE__) + "/../models") {|_self| ... } ⇒ TermExtractor constructor

A new instance of TermExtractor.

Constructor Details

#initialize(models = File.dirname(FILE) + "/../models") {|_self| ... } ⇒ `TermExtractor`

Returns a new instance of TermExtractor.

Yields:

(_self)

Yield Parameters:

_self (TermExtractor) —

the object that the method was called on

# File 'lib/term-extractor.rb', line 20

def initialize(models = File.dirname(__FILE__) + "/../models")
  @nlp = NLP.new(models)

  # Empirically, terms longer than about 5 words seem to be either
  # too specific to be useful or very noisy.
  @max_term_length = 4 


  self.remove_urls = true
  self.remove_paths = true

  yield self if block_given?
end

Instance Attribute Details

#max_term_length ⇒ `Object`

Returns the value of attribute max_term_length.



18
19
20

# File 'lib/term-extractor.rb', line 18

def max_term_length
  @max_term_length
end

#nlp ⇒ `Object`

Returns the value of attribute nlp.



18
19
20

# File 'lib/term-extractor.rb', line 18

def nlp
  @nlp
end

#remove_paths ⇒ `Object`

Returns the value of attribute remove_paths.



18
19
20

# File 'lib/term-extractor.rb', line 18

def remove_paths
  @remove_paths
end

#remove_urls ⇒ `Object`

Returns the value of attribute remove_urls.



18
19
20

# File 'lib/term-extractor.rb', line 18

def remove_urls
  @remove_urls
end

Class Method Details

.allowed_term?(p) ⇒ `Boolean`

Final post filter on terms to determine if they’re allowed.

Returns:

(Boolean)

# File 'lib/term-extractor.rb', line 229

def self.allowed_term?(p)
  return false if p.to_s =~ /^[^a-zA-Z]*$/ # We don't allow things which are just sequences of numbers
  return false if p.to_s.length > 255
  true
end

.recombobulate_term(term) ⇒ `Object`

Take a sequence of tokens and turn them back into a term.

# File 'lib/term-extractor.rb', line 236

def self.recombobulate_term(term)
  term = term.join(" ")
  term.gsub!(/ '/, "'")
  term.gsub!(/ \./, ".")
  term
end

Instance Method Details

#extract_terms_from_sentence(sentence) ⇒ `Object`

Extract all terms in a given sentence.



211
212
213

# File 'lib/term-extractor.rb', line 211

def extract_terms_from_sentence(sentence)
  TermContext.new(self, sentence).terms
end

#extract_terms_from_text(text) ⇒ `Object`

# File 'lib/term-extractor.rb', line 215

def extract_terms_from_text(text)
  if block_given?
    nlp.sentences(text).each_with_index do |s, i|
      terms = extract_terms_from_sentence(s);
      terms.each{|p| p.sentence = i; yield(p) }
    end
  else
    results = []
    extract_terms_from_text(text){ |p| results << p }
    results
  end
end

Class: TermExtractor

Overview

Defined Under Namespace

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(models = File.dirname(__FILE__) + "/../models") {|_self| ... } ⇒ TermExtractor

Instance Attribute Details

#max_term_length ⇒ Object

#nlp ⇒ Object

#remove_paths ⇒ Object

#remove_urls ⇒ Object