Class: TeRex::Classifier::BayesData

Inherits:
Object
  • Object
show all
Defined in:
lib/te_rex/bayes_data.rb

Class Method Summary collapse

Class Method Details

.clean(text) ⇒ Object

Return text with datetime and moneyterms replaced, remove cardinal terms (1st, 23rd, 42nd), remove punctuation. At one point we were replacing any non-word chars exlcuding spaces (/[^ws]/) like so ‘gsub(//, “”)` but I took it out as it removed some punctuation needed to distinguish some classes.



48
49
50
51
52
53
54
55
# File 'lib/te_rex/bayes_data.rb', line 48

def clean(text)
  dt = date_time(text)
  mt = money_term(dt)
  rp = remove_punct(mt)
  sp = remove_big_space(rp)
  ss = remove_space_seq(sp)
  remove_cardinal(ss)
end

.clean_filtered_index(text) ⇒ Object

Return a filtered word freq index without extra punctuation or short words



63
64
65
# File 'lib/te_rex/bayes_data.rb', line 63

def clean_filtered_index(text)
  filtered_index clean(text).split
end

.clean_naive_index(text) ⇒ Object

Return a word freq index without downcasing, stemming, or filtering with stop list



68
69
70
# File 'lib/te_rex/bayes_data.rb', line 68

def clean_naive_index(text)
  naive_index clean(text).split
end

.clean_stemmed_filtered_index(text) ⇒ Object

Return a filtered word freq index with stemmed morphemes and without extra punctuation or short words



58
59
60
# File 'lib/te_rex/bayes_data.rb', line 58

def clean_stemmed_filtered_index(text)
  stemmed_filtered_index clean(text).split
end

.date_time(s) ⇒ Object

Replace date times with TERM (09MAR04, 02-23-14, 2014/03/05)



29
30
31
# File 'lib/te_rex/bayes_data.rb', line 29

def date_time(s)
  s.gsub(/(^\d+)|(\s\d+(AM|PM))|(\d{2}\w{3}\d{2})|(\d{2}\:\d{2})|(\d{2,4}\-\d{2,4}-\d{2,4})|(\d{1,4}\/\d{2,4}\/\d{2,4})|(\d+\:\d+)/, 'datetime')
end

.index_frequency(text) ⇒ Object

Return a Hashed Index of words => instance_count. Each word in the string is interned and shows count in the document.



40
41
42
43
44
# File 'lib/te_rex/bayes_data.rb', line 40

def index_frequency(text)
  cfi = clean_stemmed_filtered_index(text)
  cni = clean_filtered_index(text)
  cfi.merge(cni)
end

.money_term(s) ⇒ Object

Replace money types with TERM ($60, 120.00, $423.89)



34
35
36
# File 'lib/te_rex/bayes_data.rb', line 34

def money_term(s)
  s.gsub(/(\$\d+\.\d+)|(\$\d+)|(\d+\.\d+)/, 'moneyterm')
end

.remove_big_space(s) ⇒ Object

Remove all kinds of newlines or big spaces: tab, newline, carraige return



14
15
16
# File 'lib/te_rex/bayes_data.rb', line 14

def remove_big_space(s)
  s.gsub(/\n|\t|\r/,' ')
end

.remove_cardinal(s) ⇒ Object

Remove cardinal terms (1st, 23rd, 42nd)



24
25
26
# File 'lib/te_rex/bayes_data.rb', line 24

def remove_cardinal(s)
  s.gsub(/[0-9]{2}[a-z,A-Z]{2}/, '')
end

.remove_punct(s) ⇒ Object

Remove all kinds of explicit punctuation.



9
10
11
# File 'lib/te_rex/bayes_data.rb', line 9

def remove_punct(s)
  s.gsub(/(\,)|(\?)|(\.)|(\!)|(\;)|(\:)|(\")|(\@)|(\#)|(\$)|(\^)|(\&)|(\*)|(\()|(\))|(\_)|(\=)|(\+)|(\[)|(\])|(\\)|(\|)|(\<)|(\>)|(\/)|(\`)|(\{)|(\})/, ' ')
end

.remove_space_seq(s) ⇒ Object

Remove sequences of whitespace



19
20
21
# File 'lib/te_rex/bayes_data.rb', line 19

def remove_space_seq(s)
  s.gsub(/\s{2,}/,' ')
end