Class: TeRex::Classifier::BayesData

Inherits:

Object

Object
TeRex::Classifier::BayesData

show all

Defined in:: lib/te_rex/bayes_data.rb

Class Method Summary collapse

.clean(text) ⇒ Object

Return text with datetime and moneyterms replaced, remove cardinal terms (1st, 23rd, 42nd), remove punctuation.
.clean_filtered_index(text) ⇒ Object

Return a filtered word freq index without extra punctuation or short words.
.clean_naive_index(text) ⇒ Object

Return a word freq index without downcasing, stemming, or filtering with stop list.
.clean_stemmed_filtered_index(text) ⇒ Object

Return a filtered word freq index with stemmed morphemes and without extra punctuation or short words.
.date_time(s) ⇒ Object

Replace date times with TERM (09MAR04, 02-23-14, 2014/03/05).
.index_frequency(text) ⇒ Object

Return a Hashed Index of words => instance_count.
.money_term(s) ⇒ Object

Replace money types with TERM ($60, 120.00, $423.89).
.remove_big_space(s) ⇒ Object

Remove all kinds of newlines or big spaces: tab, newline, carraige return.
.remove_cardinal(s) ⇒ Object

Remove cardinal terms (1st, 23rd, 42nd).
.remove_punct(s) ⇒ Object

Remove all kinds of explicit punctuation.
.remove_space_seq(s) ⇒ Object

Remove sequences of whitespace.

Class Method Details

.clean(text) ⇒ `Object`

Return text with datetime and moneyterms replaced, remove cardinal terms (1st, 23rd, 42nd), remove punctuation. At one point we were replacing any non-word chars exlcuding spaces (/[^ws]/) like so ‘gsub(//, “”)` but I took it out as it removed some punctuation needed to distinguish some classes.

# File 'lib/te_rex/bayes_data.rb', line 48

def clean(text)
  dt = date_time(text)
  mt = money_term(dt)
  rp = remove_punct(mt)
  sp = remove_big_space(rp)
  ss = remove_space_seq(sp)
  remove_cardinal(ss)
end

.clean_filtered_index(text) ⇒ `Object`

Return a filtered word freq index without extra punctuation or short words



63
64
65

# File 'lib/te_rex/bayes_data.rb', line 63

def clean_filtered_index(text)
  filtered_index clean(text).split
end

.clean_naive_index(text) ⇒ `Object`

Return a word freq index without downcasing, stemming, or filtering with stop list



68
69
70

# File 'lib/te_rex/bayes_data.rb', line 68

def clean_naive_index(text)
  naive_index clean(text).split
end

.clean_stemmed_filtered_index(text) ⇒ `Object`

Return a filtered word freq index with stemmed morphemes and without extra punctuation or short words



58
59
60

# File 'lib/te_rex/bayes_data.rb', line 58

def clean_stemmed_filtered_index(text)
  stemmed_filtered_index clean(text).split
end

.date_time(s) ⇒ `Object`

Replace date times with TERM (09MAR04, 02-23-14, 2014/03/05)



29
30
31

# File 'lib/te_rex/bayes_data.rb', line 29

def date_time(s)
  s.gsub(/(^\d+)|(\s\d+(AM|PM))|(\d{2}\w{3}\d{2})|(\d{2}\:\d{2})|(\d{2,4}\-\d{2,4}-\d{2,4})|(\d{1,4}\/\d{2,4}\/\d{2,4})|(\d+\:\d+)/, 'datetime')
end

.index_frequency(text) ⇒ `Object`

Return a Hashed Index of words => instance_count. Each word in the string is interned and shows count in the document.

# File 'lib/te_rex/bayes_data.rb', line 40

def index_frequency(text)
  cfi = clean_stemmed_filtered_index(text)
  cni = clean_filtered_index(text)
  cfi.merge(cni)
end

.money_term(s) ⇒ `Object`

Replace money types with TERM ($60, 120.00, $423.89)



34
35
36

# File 'lib/te_rex/bayes_data.rb', line 34

def money_term(s)
  s.gsub(/(\$\d+\.\d+)|(\$\d+)|(\d+\.\d+)/, 'moneyterm')
end

.remove_big_space(s) ⇒ `Object`

Remove all kinds of newlines or big spaces: tab, newline, carraige return



14
15
16

# File 'lib/te_rex/bayes_data.rb', line 14

def remove_big_space(s)
  s.gsub(/\n|\t|\r/,' ')
end

.remove_cardinal(s) ⇒ `Object`

Remove cardinal terms (1st, 23rd, 42nd)



24
25
26

# File 'lib/te_rex/bayes_data.rb', line 24

def remove_cardinal(s)
  s.gsub(/[0-9]{2}[a-z,A-Z]{2}/, '')
end

.remove_punct(s) ⇒ `Object`

Remove all kinds of explicit punctuation.



9
10
11

# File 'lib/te_rex/bayes_data.rb', line 9

def remove_punct(s)
  s.gsub(/(\,)|(\?)|(\.)|(\!)|(\;)|(\:)|(\")|(\@)|(\#)|(\$)|(\^)|(\&)|(\*)|(\()|(\))|(\_)|(\=)|(\+)|(\[)|(\])|(\\)|(\|)|(\<)|(\>)|(\/)|(\`)|(\{)|(\})/, ' ')
end

.remove_space_seq(s) ⇒ `Object`

Remove sequences of whitespace



19
20
21

# File 'lib/te_rex/bayes_data.rb', line 19

def remove_space_seq(s)
  s.gsub(/\s{2,}/,' ')
end

Class: TeRex::Classifier::BayesData

Class Method Summary collapse

Class Method Details

.clean(text) ⇒ Object

.clean_filtered_index(text) ⇒ Object

.clean_naive_index(text) ⇒ Object

.clean_stemmed_filtered_index(text) ⇒ Object

.date_time(s) ⇒ Object

.index_frequency(text) ⇒ Object

.money_term(s) ⇒ Object

.remove_big_space(s) ⇒ Object

.remove_cardinal(s) ⇒ Object

.remove_punct(s) ⇒ Object

.remove_space_seq(s) ⇒ Object