Class: Splitta::Frag

Inherits:
Object
  • Object
show all
Includes:
WordTokenizer
Defined in:
lib/splitta/frag.rb

Constant Summary

Constants included from WordTokenizer

WordTokenizer::TOKENIZE_REGEXPS

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from WordTokenizer

#tokenize

Constructor Details

#initialize(orig, previous_frag: nil) ⇒ Frag

Returns a new instance of Frag.



12
13
14
15
16
17
# File 'lib/splitta/frag.rb', line 12

def initialize(orig, previous_frag: nil)
  words = clean(orig).split
  previous_frag.next_word = words.first if previous_frag
  @orig = orig
  @last_word = words.last
end

Instance Attribute Details

#last_wordObject (readonly)

Returns the value of attribute last_word.



9
10
11
# File 'lib/splitta/frag.rb', line 9

def last_word
  @last_word
end

#next_wordObject

Returns the value of attribute next_word.



9
10
11
# File 'lib/splitta/frag.rb', line 9

def next_word
  @next_word
end

#origObject (readonly)

Returns the value of attribute orig.



9
10
11
# File 'lib/splitta/frag.rb', line 9

def orig
  @orig
end

#predObject

Returns the value of attribute pred.



10
11
12
# File 'lib/splitta/frag.rb', line 10

def pred
  @pred
end

Instance Method Details

#features(model) ⇒ Object

… w1. (sb?) w2 … Features, listed roughly in order of importance:

(1) w1: word that includes a period (2) w2: the next word, if it exists (3) w1length: number of alphabetic characters in w1 (4) w2cap: true if w2 is capitalized (5) both: w1 and w2 (6) w1abbr: log count of w1 in training without a final period (7) w2lower: log count of w2 in training as lowercased (8) w1w2upper: w1 and w2 is capitalized



30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/splitta/frag.rb', line 30

def features(model)
  Enumerator.new do |y|
    y << [:w1, w1]
    y << [:w2, w2]
    y << [:both, w1, w2]

    if alphabetic?(w1)
      y << [:w1length, w1length]
      y << [:w1abbr, w1abbr(model)]
    end

    if alphabetic?(w2)
      y << [:w2cap, w2cap]
      y << [:w2lower, w2lower(model)]
      y << [:w1w2upper, w1, w2cap]
    end
  end
end

#over?(threshold) ⇒ Boolean

Returns:

  • (Boolean)


49
50
51
# File 'lib/splitta/frag.rb', line 49

def over?(threshold)
  !!pred && pred > threshold
end