Class: TeRex::Classifier::BayesData
- Inherits:
-
Object
- Object
- TeRex::Classifier::BayesData
- Defined in:
- lib/te_rex/bayes_data.rb
Class Method Summary collapse
-
.clean(text) ⇒ Object
Return text with datetime and moneyterms replaced, remove cardinal terms (1st, 23rd, 42nd), remove punctuation.
-
.clean_filtered_index(text) ⇒ Object
Return a filtered word freq index without extra punctuation or short words.
-
.clean_naive_index(text) ⇒ Object
Return a word freq index without downcasing, stemming, or filtering with stop list.
-
.clean_stemmed_filtered_index(text) ⇒ Object
Return a filtered word freq index with stemmed morphemes and without extra punctuation or short words.
-
.date_time(s) ⇒ Object
Replace date times with TERM (09MAR04, 02-23-14, 2014/03/05).
-
.index_frequency(text) ⇒ Object
Return a Hashed Index of words => instance_count.
-
.money_term(s) ⇒ Object
Replace money types with TERM ($60, 120.00, $423.89).
-
.remove_big_space(s) ⇒ Object
Remove all kinds of newlines or big spaces: tab, newline, carraige return.
-
.remove_cardinal(s) ⇒ Object
Remove cardinal terms (1st, 23rd, 42nd).
-
.remove_punct(s) ⇒ Object
Remove all kinds of explicit punctuation.
-
.remove_space_seq(s) ⇒ Object
Remove sequences of whitespace.
Class Method Details
.clean(text) ⇒ Object
Return text with datetime and moneyterms replaced, remove cardinal terms (1st, 23rd, 42nd), remove punctuation. At one point we were replacing any non-word chars exlcuding spaces (/[^ws]/) like so ‘gsub(//, “”)` but I took it out as it removed some punctuation needed to distinguish some classes.
48 49 50 51 52 53 54 55 |
# File 'lib/te_rex/bayes_data.rb', line 48 def clean(text) dt = date_time(text) mt = money_term(dt) rp = remove_punct(mt) sp = remove_big_space(rp) ss = remove_space_seq(sp) remove_cardinal(ss) end |
.clean_filtered_index(text) ⇒ Object
Return a filtered word freq index without extra punctuation or short words
63 64 65 |
# File 'lib/te_rex/bayes_data.rb', line 63 def clean_filtered_index(text) filtered_index clean(text).split end |
.clean_naive_index(text) ⇒ Object
Return a word freq index without downcasing, stemming, or filtering with stop list
68 69 70 |
# File 'lib/te_rex/bayes_data.rb', line 68 def clean_naive_index(text) naive_index clean(text).split end |
.clean_stemmed_filtered_index(text) ⇒ Object
Return a filtered word freq index with stemmed morphemes and without extra punctuation or short words
58 59 60 |
# File 'lib/te_rex/bayes_data.rb', line 58 def clean_stemmed_filtered_index(text) stemmed_filtered_index clean(text).split end |
.date_time(s) ⇒ Object
Replace date times with TERM (09MAR04, 02-23-14, 2014/03/05)
29 30 31 |
# File 'lib/te_rex/bayes_data.rb', line 29 def date_time(s) s.gsub(/(^\d+)|(\s\d+(AM|PM))|(\d{2}\w{3}\d{2})|(\d{2}\:\d{2})|(\d{2,4}\-\d{2,4}-\d{2,4})|(\d{1,4}\/\d{2,4}\/\d{2,4})|(\d+\:\d+)/, 'datetime') end |
.index_frequency(text) ⇒ Object
Return a Hashed Index of words => instance_count. Each word in the string is interned and shows count in the document.
40 41 42 43 44 |
# File 'lib/te_rex/bayes_data.rb', line 40 def index_frequency(text) cfi = clean_stemmed_filtered_index(text) cni = clean_filtered_index(text) cfi.merge(cni) end |
.money_term(s) ⇒ Object
Replace money types with TERM ($60, 120.00, $423.89)
34 35 36 |
# File 'lib/te_rex/bayes_data.rb', line 34 def money_term(s) s.gsub(/(\$\d+\.\d+)|(\$\d+)|(\d+\.\d+)/, 'moneyterm') end |
.remove_big_space(s) ⇒ Object
Remove all kinds of newlines or big spaces: tab, newline, carraige return
14 15 16 |
# File 'lib/te_rex/bayes_data.rb', line 14 def remove_big_space(s) s.gsub(/\n|\t|\r/,' ') end |
.remove_cardinal(s) ⇒ Object
Remove cardinal terms (1st, 23rd, 42nd)
24 25 26 |
# File 'lib/te_rex/bayes_data.rb', line 24 def remove_cardinal(s) s.gsub(/[0-9]{2}[a-z,A-Z]{2}/, '') end |
.remove_punct(s) ⇒ Object
Remove all kinds of explicit punctuation.
9 10 11 |
# File 'lib/te_rex/bayes_data.rb', line 9 def remove_punct(s) s.gsub(/(\,)|(\?)|(\.)|(\!)|(\;)|(\:)|(\")|(\@)|(\#)|(\$)|(\^)|(\&)|(\*)|(\()|(\))|(\_)|(\=)|(\+)|(\[)|(\])|(\\)|(\|)|(\<)|(\>)|(\/)|(\`)|(\{)|(\})/, ' ') end |
.remove_space_seq(s) ⇒ Object
Remove sequences of whitespace
19 20 21 |
# File 'lib/te_rex/bayes_data.rb', line 19 def remove_space_seq(s) s.gsub(/\s{2,}/,' ') end |