Class: Splitta::Frag
- Inherits:
-
Object
- Object
- Splitta::Frag
- Includes:
- WordTokenizer
- Defined in:
- lib/splitta/frag.rb
Constant Summary
Constants included from WordTokenizer
WordTokenizer::TOKENIZE_REGEXPS
Instance Attribute Summary collapse
-
#last_word ⇒ Object
readonly
Returns the value of attribute last_word.
-
#next_word ⇒ Object
readonly
Returns the value of attribute next_word.
-
#orig ⇒ Object
readonly
Returns the value of attribute orig.
-
#pred ⇒ Object
Returns the value of attribute pred.
Instance Method Summary collapse
-
#features(model) ⇒ Object
…
-
#initialize(orig, previous_frag: nil) ⇒ Frag
constructor
A new instance of Frag.
- #over?(threshold) ⇒ Boolean
Methods included from WordTokenizer
Constructor Details
#initialize(orig, previous_frag: nil) ⇒ Frag
Returns a new instance of Frag.
12 13 14 15 16 17 |
# File 'lib/splitta/frag.rb', line 12 def initialize(orig, previous_frag: nil) words = clean(orig).split previous_frag.next_word = words.first if previous_frag @orig = orig @last_word = words.last end |
Instance Attribute Details
#last_word ⇒ Object (readonly)
Returns the value of attribute last_word.
9 10 11 |
# File 'lib/splitta/frag.rb', line 9 def last_word @last_word end |
#next_word ⇒ Object
Returns the value of attribute next_word.
9 10 11 |
# File 'lib/splitta/frag.rb', line 9 def next_word @next_word end |
#orig ⇒ Object (readonly)
Returns the value of attribute orig.
9 10 11 |
# File 'lib/splitta/frag.rb', line 9 def orig @orig end |
#pred ⇒ Object
Returns the value of attribute pred.
10 11 12 |
# File 'lib/splitta/frag.rb', line 10 def pred @pred end |
Instance Method Details
#features(model) ⇒ Object
… w1. (sb?) w2 … Features, listed roughly in order of importance:
(1) w1: word that includes a period (2) w2: the next word, if it exists (3) w1length: number of alphabetic characters in w1 (4) w2cap: true if w2 is capitalized (5) both: w1 and w2 (6) w1abbr: log count of w1 in training without a final period (7) w2lower: log count of w2 in training as lowercased (8) w1w2upper: w1 and w2 is capitalized
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/splitta/frag.rb', line 30 def features(model) Enumerator.new do |y| y << [:w1, w1] y << [:w2, w2] y << [:both, w1, w2] if alphabetic?(w1) y << [:w1length, w1length] y << [:w1abbr, w1abbr(model)] end if alphabetic?(w2) y << [:w2cap, w2cap] y << [:w2lower, w2lower(model)] y << [:w1w2upper, w1, w2cap] end end end |
#over?(threshold) ⇒ Boolean
49 50 51 |
# File 'lib/splitta/frag.rb', line 49 def over?(threshold) !!pred && pred > threshold end |