Module: BagOfWords

Defined in:
lib/rbbt/bow/bow.rb

Overview

This module provides methods to extract a bag of words (or bag of bigrams) representation for strings of text, and to produce a vector representations of that bag of words for a given list of terms. This BOW representations of the texts is usually first used to build a Dictionary, and then, with the best selection of terms as determined by the Dictionary::TF_IDF.best of Dictionary::KL.best methods, determine the vector representations for that text.

Class Method Summary collapse

Class Method Details

.bigrams(text) ⇒ Object

Take the array of words for the text and form all the bigrams



29
30
31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/rbbt/bow/bow.rb', line 29

def self.bigrams(text)
  words = words(text)
  bigrams = []
  lastword = nil

  words.each{|word|
    if lastword
      bigrams << "#{lastword} #{word}"
    end
    lastword = word
  }

  words + bigrams
end

.count(terms) ⇒ Object

Given an array of terms return a hash with the number of appearances of each term



46
47
48
49
50
# File 'lib/rbbt/bow/bow.rb', line 46

def self.count(terms)
  count = Hash.new(0)
  terms.each{|word| count[word] += 1}
  count
end

.features(text, terms, bigrams = nil) ⇒ Object

Given a string of text and a list of terms, which may or may not contain bigrams, return an array with one entry per term which holds the number of occurrences of each term in the text.



67
68
69
70
71
# File 'lib/rbbt/bow/bow.rb', line 67

def self.features(text, terms, bigrams = nil)
  bigrams ||= terms.select{|term| term =~ / /}.any?
  count = bigrams ? count(bigrams(text)) : count(words(text))
  count.values_at(*terms)
end

.terms(text, bigrams = true) ⇒ Object

Given a string of text find all the words (or bigrams) and return a hash with their counts



55
56
57
58
59
60
61
62
# File 'lib/rbbt/bow/bow.rb', line 55

def self.terms(text, bigrams = true)

  if bigrams
    count(bigrams(text))
  else
    count(words(text))
  end
end

.words(text) ⇒ Object

Divide the input string into an array of words (sequences of w characters). Words are stemmed and filtered to remove stopwords and words with less than 2 characters. The list of stopwords is a global variable defined in ‘rbbt/util/misc’.



16
17
18
19
20
21
22
23
24
25
26
# File 'lib/rbbt/bow/bow.rb', line 16

def self.words(text)
  return [] if text.nil?
  raise "Stopword list not loaded. Have you installed the wordlists? (rbbt_config prepare wordlists)" if $stopwords.nil?
  text.scan(/\w+/).
    collect{|word| word.downcase.stem}.
    select{|word|  
    ! $stopwords.include?(word) && 
      word.length > 2 && 
      word =~ /[a-z]/
  }
end