Class: TermExtractor
- Inherits:
-
Object
- Object
- TermExtractor
- Defined in:
- lib/term-extractor.rb,
lib/term-extractor/nlp.rb
Overview
A class for extracting useful snippets of text from a document
Defined Under Namespace
Classes: NLP, TermContext
Instance Attribute Summary collapse
-
#max_term_length ⇒ Object
Returns the value of attribute max_term_length.
-
#nlp ⇒ Object
Returns the value of attribute nlp.
-
#remove_paths ⇒ Object
Returns the value of attribute remove_paths.
-
#remove_urls ⇒ Object
Returns the value of attribute remove_urls.
Class Method Summary collapse
-
.allowed_term?(p) ⇒ Boolean
Final post filter on terms to determine if they’re allowed.
-
.recombobulate_term(term) ⇒ Object
Take a sequence of tokens and turn them back into a term.
Instance Method Summary collapse
-
#extract_terms_from_sentence(sentence) ⇒ Object
Extract all terms in a given sentence.
- #extract_terms_from_text(text) ⇒ Object
-
#initialize(models = File.dirname(__FILE__) + "/../models") {|_self| ... } ⇒ TermExtractor
constructor
A new instance of TermExtractor.
Constructor Details
#initialize(models = File.dirname(__FILE__) + "/../models") {|_self| ... } ⇒ TermExtractor
Returns a new instance of TermExtractor.
20 21 22 23 24 25 26 27 28 29 30 31 32 |
# File 'lib/term-extractor.rb', line 20 def initialize(models = File.dirname(__FILE__) + "/../models") @nlp = NLP.new(models) # Empirically, terms longer than about 5 words seem to be either # too specific to be useful or very noisy. @max_term_length = 4 self.remove_urls = true self.remove_paths = true yield self if block_given? end |
Instance Attribute Details
#max_term_length ⇒ Object
Returns the value of attribute max_term_length.
18 19 20 |
# File 'lib/term-extractor.rb', line 18 def max_term_length @max_term_length end |
#nlp ⇒ Object
Returns the value of attribute nlp.
18 19 20 |
# File 'lib/term-extractor.rb', line 18 def nlp @nlp end |
#remove_paths ⇒ Object
Returns the value of attribute remove_paths.
18 19 20 |
# File 'lib/term-extractor.rb', line 18 def remove_paths @remove_paths end |
#remove_urls ⇒ Object
Returns the value of attribute remove_urls.
18 19 20 |
# File 'lib/term-extractor.rb', line 18 def remove_urls @remove_urls end |
Class Method Details
.allowed_term?(p) ⇒ Boolean
Final post filter on terms to determine if they’re allowed.
229 230 231 232 233 |
# File 'lib/term-extractor.rb', line 229 def self.allowed_term?(p) return false if p.to_s =~ /^[^a-zA-Z]*$/ # We don't allow things which are just sequences of numbers return false if p.to_s.length > 255 true end |
.recombobulate_term(term) ⇒ Object
Take a sequence of tokens and turn them back into a term.
236 237 238 239 240 241 |
# File 'lib/term-extractor.rb', line 236 def self.recombobulate_term(term) term = term.join(" ") term.gsub!(/ '/, "'") term.gsub!(/ \./, ".") term end |
Instance Method Details
#extract_terms_from_sentence(sentence) ⇒ Object
Extract all terms in a given sentence.
211 212 213 |
# File 'lib/term-extractor.rb', line 211 def extract_terms_from_sentence(sentence) TermContext.new(self, sentence).terms end |
#extract_terms_from_text(text) ⇒ Object
215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/term-extractor.rb', line 215 def extract_terms_from_text(text) if block_given? nlp.sentences(text).each_with_index do |s, i| terms = extract_terms_from_sentence(s); terms.each{|p| p.sentence = i; yield(p) } end else results = [] extract_terms_from_text(text){ |p| results << p } results end end |