Module: BagOfWords
- Defined in:
- lib/rbbt/bow/bow.rb
Overview
This module provides methods to extract a bag of words (or bag of bigrams) representation for strings of text, and to produce a vector representations of that bag of words for a given list of terms. This BOW representations of the texts is usually first used to build a Dictionary, and then, with the best selection of terms as determined by the Dictionary::TF_IDF.best of Dictionary::KL.best methods, determine the vector representations for that text.
Class Method Summary collapse
-
.bigrams(text) ⇒ Object
Take the array of words for the text and form all the bigrams.
-
.count(terms) ⇒ Object
Given an array of terms return a hash with the number of appearances of each term.
-
.features(text, terms, bigrams = nil) ⇒ Object
Given a string of text and a list of terms, which may or may not contain bigrams, return an array with one entry per term which holds the number of occurrences of each term in the text.
-
.terms(text, bigrams = true) ⇒ Object
Given a string of text find all the words (or bigrams) and return a hash with their counts.
-
.words(text) ⇒ Object
Divide the input string into an array of words (sequences of w characters).
Class Method Details
.bigrams(text) ⇒ Object
Take the array of words for the text and form all the bigrams
29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/rbbt/bow/bow.rb', line 29 def self.bigrams(text) words = words(text) bigrams = [] lastword = nil words.each{|word| if lastword bigrams << "#{lastword} #{word}" end lastword = word } words + bigrams end |
.count(terms) ⇒ Object
Given an array of terms return a hash with the number of appearances of each term
46 47 48 49 50 |
# File 'lib/rbbt/bow/bow.rb', line 46 def self.count(terms) count = Hash.new(0) terms.each{|word| count[word] += 1} count end |
.features(text, terms, bigrams = nil) ⇒ Object
Given a string of text and a list of terms, which may or may not contain bigrams, return an array with one entry per term which holds the number of occurrences of each term in the text.
67 68 69 70 71 |
# File 'lib/rbbt/bow/bow.rb', line 67 def self.features(text, terms, bigrams = nil) bigrams ||= terms.select{|term| term =~ / /}.any? count = bigrams ? count(bigrams(text)) : count(words(text)) count.values_at(*terms) end |
.terms(text, bigrams = true) ⇒ Object
Given a string of text find all the words (or bigrams) and return a hash with their counts
55 56 57 58 59 60 61 62 |
# File 'lib/rbbt/bow/bow.rb', line 55 def self.terms(text, bigrams = true) if bigrams count(bigrams(text)) else count(words(text)) end end |
.words(text) ⇒ Object
Divide the input string into an array of words (sequences of w characters). Words are stemmed and filtered to remove stopwords and words with less than 2 characters. The list of stopwords is a global variable defined in ‘rbbt/util/misc’.
16 17 18 19 20 21 22 23 24 25 26 |
# File 'lib/rbbt/bow/bow.rb', line 16 def self.words(text) return [] if text.nil? raise "Stopword list not loaded. Have you installed the wordlists? (rbbt_config prepare wordlists)" if $stopwords.nil? text.scan(/\w+/). collect{|word| word.downcase.stem}. select{|word| ! $stopwords.include?(word) && word.length > 2 && word =~ /[a-z]/ } end |