Class: Splitta::Frag

Inherits:

Object

Object
Splitta::Frag

show all

Includes:: WordTokenizer

Defined in:: lib/splitta/frag.rb

Constant Summary

Constants included from WordTokenizer

WordTokenizer::TOKENIZE_REGEXPS

Instance Attribute Summary collapse

#last_word ⇒ Object readonly

Returns the value of attribute last_word.
#next_word ⇒ Object readonly

Returns the value of attribute next_word.
#orig ⇒ Object readonly

Returns the value of attribute orig.
#pred ⇒ Object

Returns the value of attribute pred.

Instance Method Summary collapse

#features(model) ⇒ Object

…
#initialize(orig, previous_frag: nil) ⇒ Frag constructor

A new instance of Frag.
#over?(threshold) ⇒ Boolean

Methods included from WordTokenizer

#tokenize

Constructor Details

#initialize(orig, previous_frag: nil) ⇒ `Frag`

Returns a new instance of Frag.

# File 'lib/splitta/frag.rb', line 12

def initialize(orig, previous_frag: nil)
  words = clean(orig).split
  previous_frag.next_word = words.first if previous_frag
  @orig = orig
  @last_word = words.last
end

Instance Attribute Details

#last_word ⇒ `Object` (readonly)

Returns the value of attribute last_word.



9
10
11

# File 'lib/splitta/frag.rb', line 9

def last_word
  @last_word
end

#next_word ⇒ `Object`

Returns the value of attribute next_word.



9
10
11

# File 'lib/splitta/frag.rb', line 9

def next_word
  @next_word
end

#orig ⇒ `Object` (readonly)

Returns the value of attribute orig.



9
10
11

# File 'lib/splitta/frag.rb', line 9

def orig
  @orig
end

#pred ⇒ `Object`

Returns the value of attribute pred.



10
11
12

# File 'lib/splitta/frag.rb', line 10

def pred
  @pred
end

Instance Method Details

#features(model) ⇒ `Object`

… w1. (sb?) w2 … Features, listed roughly in order of importance:

(1) w1: word that includes a period (2) w2: the next word, if it exists (3) w1length: number of alphabetic characters in w1 (4) w2cap: true if w2 is capitalized (5) both: w1 and w2 (6) w1abbr: log count of w1 in training without a final period (7) w2lower: log count of w2 in training as lowercased (8) w1w2upper: w1 and w2 is capitalized

# File 'lib/splitta/frag.rb', line 30

def features(model)
  Enumerator.new do |y|
    y << [:w1, w1]
    y << [:w2, w2]
    y << [:both, w1, w2]

    if alphabetic?(w1)
      y << [:w1length, w1length]
      y << [:w1abbr, w1abbr(model)]
    end

    if alphabetic?(w2)
      y << [:w2cap, w2cap]
      y << [:w2lower, w2lower(model)]
      y << [:w1w2upper, w1, w2cap]
    end
  end
end

#over?(threshold) ⇒ `Boolean`

Returns:

(Boolean)



49
50
51

# File 'lib/splitta/frag.rb', line 49

def over?(threshold)
  !!pred && pred > threshold
end

Class: Splitta::Frag

Constant Summary

Constants included from WordTokenizer

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from WordTokenizer

Constructor Details

#initialize(orig, previous_frag: nil) ⇒ Frag

Instance Attribute Details

#last_word ⇒ Object (readonly)

#next_word ⇒ Object

#orig ⇒ Object (readonly)

#pred ⇒ Object