Class: Greeb::Segmentator

Inherits:
Object
  • Object
show all
Defined in:
lib/greeb/segmentator.rb

Overview

It is possible to perform simple sentence detection that is based on Greeb's tokenization.

Constant Summary collapse

SENTENCE_AINT_START =

Sentence does not start from the separator charater, line break character, punctuation characters, and spaces.

[:separ, :break, :punct, :spunct, :space]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(tokens) ⇒ Segmentator

Create a new instance of Greeb::Segmentator.

Parameters:

  • tokens (Array<Greeb::Span>)

    tokens from [Greeb::Tokenizer].


18
19
20
# File 'lib/greeb/segmentator.rb', line 18

def initialize(tokens)
  @tokens = tokens
end

Instance Attribute Details

#tokensObject (readonly)

Returns the value of attribute tokens


12
13
14
# File 'lib/greeb/segmentator.rb', line 12

def tokens
  @tokens
end

Instance Method Details

#extract(sentences, collection = tokens) ⇒ Array<Greeb::Span, Array<Greeb::Span>>

Extract tokens from the set of sentences.

Parameters:

  • sentences (Array<Greeb::Span>)

    a list of sentences.

Returns:


45
46
47
48
49
# File 'lib/greeb/segmentator.rb', line 45

def extract(sentences, collection = tokens)
  sentences.map do |s|
    [s, collection.select { |t| t.from >= s.from and t.to <= s.to }]
  end
end

#sentencesArray<Greeb::Span>

Sentences memoization method.

Returns:


26
27
28
# File 'lib/greeb/segmentator.rb', line 26

def sentences
  @sentences ||= detect_spans(new_sentence, [:punct])
end

#subsentencesArray<Greeb::Span>

Subsentences memoization method.

Returns:


34
35
36
# File 'lib/greeb/segmentator.rb', line 34

def subsentences
  @subsentences ||= detect_spans(new_subsentence, [:punct, :spunct])
end