Module: RMMSeg::Algorithm

Included in:
ComplexAlgorithm, SimpleAlgorithm
Defined in:
lib/rmmseg/algorithm.rb

Overview

An algorithm can segment a piece of text into an array of words. This module is the common operations shared by SimpleAlgorithm and ComplexAlgorithm .

Constant Summary collapse

NONWORD_CHAR_RE =

Determine whether a character can be part of a basic latin word.

/^\W$/

Instance Method Summary collapse

Instance Method Details

#basic_latin?(char) ⇒ Boolean

Determine whether a character is a basic latin character.

Returns:

  • (Boolean)


127
128
129
# File 'lib/rmmseg/algorithm.rb', line 127

def basic_latin?(char)
  char.length == 1
end

#find_match_words(index) ⇒ Object

Find all words occuring in the dictionary starting from index . The maximum word length is determined by Config.max_word_length .



89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# File 'lib/rmmseg/algorithm.rb', line 89

def find_match_words(index)
  for i, w in @match_cache
    if i == index
      return w
    end
  end
  
  dic = Dictionary.instance
  str = String.new
  strlen = 0
  words = Array.new
  i = index

  while i < @chars.length               &&
      !basic_latin?(@chars[i])          &&
      strlen < Config.max_word_length
    
    str << @chars[i]
    strlen += 1
    
    if dic.has_word?(str)
      words << dic.get_word(str)
    end
    i += 1
  end

  if words.empty?
    words << Word.new(@chars[index], Word::TYPES[:unrecognized])
  end

  @match_cache[@match_cache_idx] = [index, words]
  @match_cache_idx += 1
  @match_cache_idx = 0 if @match_cache_idx == MATCH_CACHE_MAX_LENGTH

  words
end

#get_basic_latin_wordObject

Skip whitespaces and punctuation to extract a basic latin word.



56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/rmmseg/algorithm.rb', line 56

def get_basic_latin_word
  start_pos = nil
  end_pos = nil
  
  i = @index
  while i < @chars.length     &&
      basic_latin?(@chars[i]) &&
      nonword_char?(@chars[i])
    i += 1
  end

  start_pos = @byte_index + i - @index
  while i < @chars.length && basic_latin?(@chars[i])
    break if nonword_char?(@chars[i])
    i += 1
  end

  end_pos = @byte_index + i - @index
  while i < @chars.length      &&
      basic_latin?(@chars[i])  &&
      nonword_char?(@chars[i])
    i += 1
  end

  @byte_index += i - @index
  @index = i
  
  return @token.new(@text[start_pos...end_pos], start_pos, end_pos)
end

#initialize(text, token = Token) ⇒ Object

Initialize a new instance of Algorithm, the text will then be segmented by this instance. token is the class which will be used to construct the result token.



15
16
17
18
19
20
21
# File 'lib/rmmseg/algorithm.rb', line 15

def initialize(text, token=Token)
  @text = text
  @chars = text.each_char
  @index = 0
  @byte_index = 0
  @token = token
end

#next_tokenObject

Get the next Token recognized.



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# File 'lib/rmmseg/algorithm.rb', line 24

def next_token
  return nil if @index >= @chars.length

  if basic_latin?(@chars[@index])
    token = get_basic_latin_word
  else
    token = get_cjk_word
  end

  if token.start == token.end # empty
    return next_token
  else
    return token
  end
end

#nonword_char?(char) ⇒ Boolean

Returns:

  • (Boolean)


134
135
136
# File 'lib/rmmseg/algorithm.rb', line 134

def nonword_char?(char)
  NONWORD_CHAR_RE =~ char
end

#segmentObject

Segment the string in text into an array of words.



42
43
44
45
46
47
48
49
50
51
52
# File 'lib/rmmseg/algorithm.rb', line 42

def segment
  words = Array.new

  token = next_token
  until token.nil?
    words << token.text
    token = next_token
  end

  words
end