Module: RMMSeg::Algorithm

Included in:: ComplexAlgorithm, SimpleAlgorithm

Defined in:: lib/rmmseg/algorithm.rb

Overview

An algorithm can segment a piece of text into an array of words. This module is the common operations shared by SimpleAlgorithm and ComplexAlgorithm .

Constant Summary collapse

NONWORD_CHAR_RE = Determine whether a character can be part of a basic latin word.

/^\W$/

Instance Method Summary collapse

#basic_latin?(char) ⇒ Boolean

Determine whether a character is a basic latin character.
#find_match_words(index) ⇒ Object

Find all words occuring in the dictionary starting from index .
#get_basic_latin_word ⇒ Object

Skip whitespaces and punctuation to extract a basic latin word.
#initialize(text, token = Token) ⇒ Object

Initialize a new instance of Algorithm, the text will then be segmented by this instance.
#next_token ⇒ Object

Get the next Token recognized.
#nonword_char?(char) ⇒ Boolean
#segment ⇒ Object

Segment the string in text into an array of words.

Instance Method Details

#basic_latin?(char) ⇒ `Boolean`

Determine whether a character is a basic latin character.

Returns:

(Boolean)



127
128
129

# File 'lib/rmmseg/algorithm.rb', line 127

def basic_latin?(char)
  char.length == 1
end

#find_match_words(index) ⇒ `Object`

Find all words occuring in the dictionary starting from index . The maximum word length is determined by Config.max_word_length .

# File 'lib/rmmseg/algorithm.rb', line 89

def find_match_words(index)
  for i, w in @match_cache
    if i == index
      return w
    end
  end
  
  dic = Dictionary.instance
  str = String.new
  strlen = 0
  words = Array.new
  i = index

  while i < @chars.length               &&
      !basic_latin?(@chars[i])          &&
      strlen < Config.max_word_length
    
    str << @chars[i]
    strlen += 1
    
    if dic.has_word?(str)
      words << dic.get_word(str)
    end
    i += 1
  end

  if words.empty?
    words << Word.new(@chars[index], Word::TYPES[:unrecognized])
  end

  @match_cache[@match_cache_idx] = [index, words]
  @match_cache_idx += 1
  @match_cache_idx = 0 if @match_cache_idx == MATCH_CACHE_MAX_LENGTH

  words
end

#get_basic_latin_word ⇒ `Object`

Skip whitespaces and punctuation to extract a basic latin word.

# File 'lib/rmmseg/algorithm.rb', line 56

def get_basic_latin_word
  start_pos = nil
  end_pos = nil
  
  i = @index
  while i < @chars.length     &&
      basic_latin?(@chars[i]) &&
      nonword_char?(@chars[i])
    i += 1
  end

  start_pos = @byte_index + i - @index
  while i < @chars.length && basic_latin?(@chars[i])
    break if nonword_char?(@chars[i])
    i += 1
  end

  end_pos = @byte_index + i - @index
  while i < @chars.length      &&
      basic_latin?(@chars[i])  &&
      nonword_char?(@chars[i])
    i += 1
  end

  @byte_index += i - @index
  @index = i
  
  return @token.new(@text[start_pos...end_pos], start_pos, end_pos)
end

#initialize(text, token = Token) ⇒ `Object`

Initialize a new instance of Algorithm, the text will then be segmented by this instance. token is the class which will be used to construct the result token.

# File 'lib/rmmseg/algorithm.rb', line 15

def initialize(text, token=Token)
  @text = text
  @chars = text.each_char
  @index = 0
  @byte_index = 0
  @token = token
end

#next_token ⇒ `Object`

Get the next Token recognized.

# File 'lib/rmmseg/algorithm.rb', line 24

def next_token
  return nil if @index >= @chars.length

  if basic_latin?(@chars[@index])
    token = get_basic_latin_word
  else
    token = get_cjk_word
  end

  if token.start == token.end # empty
    return next_token
  else
    return token
  end
end

#nonword_char?(char) ⇒ `Boolean`

Returns:

(Boolean)



134
135
136

# File 'lib/rmmseg/algorithm.rb', line 134

def nonword_char?(char)
  NONWORD_CHAR_RE =~ char
end

#segment ⇒ `Object`

Segment the string in text into an array of words.

# File 'lib/rmmseg/algorithm.rb', line 42

def segment
  words = Array.new

  token = next_token
  until token.nil?
    words << token.text
    token = next_token
  end

  words
end

Module: RMMSeg::Algorithm

Overview

Constant Summary collapse

Instance Method Summary collapse

Instance Method Details

#basic_latin?(char) ⇒ Boolean

#find_match_words(index) ⇒ Object

#get_basic_latin_word ⇒ Object

#initialize(text, token = Token) ⇒ Object

#next_token ⇒ Object

#nonword_char?(char) ⇒ Boolean