Class: SRX::English::WordSplitter

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/srx/english/word_splitter.rb

Constant Summary collapse

SPLIT_RULES =
{
  :word => "\\p{Alpha}\\p{Word}*",
  :number => "\\p{Digit}+(?:[:., _/-]\\p{Digit}+)*",
  :punct => "\\p{Punct}",
  :graph => "\\p{Graph}",
  :other => "[^\\p{Word}\\p{Graph}]+"
}
SPLIT_RE =
/#{SPLIT_RULES.values.map{|v| "(#{v})"}.join("|")}/m

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(sentence = nil) ⇒ WordSplitter

The initializer accepts a sentence, which might be a Sentence instance or a String instance.

The splitter might be initialized without the sentence, but should be set using the accessor before first call to each method.



25
26
27
# File 'lib/srx/english/word_splitter.rb', line 25

def initialize(sentence=nil)
  @sentence = sentence
end

Instance Attribute Details

#sentenceObject

Returns the value of attribute sentence.



8
9
10
# File 'lib/srx/english/word_splitter.rb', line 8

def sentence
  @sentence
end

Instance Method Details

#eachObject

This method iterates over the words in the sentence. It yields the string representation of the word and its type, which is one of:

  • :word - a regular word (including words containing numbers, like A4)

  • :number - a number (including number with spaces, dashes, slashes, etc.)

  • :punct - single punctuation character (comma, semicolon, full stop, etc.)

  • :graph - any single graphical (visible) character

  • :other - anything which is not covered by the above types (non-visible characters in particular)



38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/srx/english/word_splitter.rb', line 38

def each
  raise "Invalid argument - sentence is nil" if @sentence.nil?
  @sentence.scan(SPLIT_RE) do |word,number,punct,graph,other|
    start_offset = $~.begin(0)
    end_offset = $~.end(0)-1
    if !word.nil?
      yield word, :word, start_offset, end_offset
    elsif !number.nil?
      yield number, :number, start_offset, end_offset
    elsif !punct.nil?
      yield punct, :punct, start_offset, end_offset
    elsif !graph.nil?
      yield graph, :graph, start_offset, end_offset
    else
      yield other, :other, start_offset, end_offset
    end
  end
end