Class: Treat::Workers::Processors::Segmenters::Stanford

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/processors/segmenters/stanford.rb

Overview

Detects sentence boundaries by first tokenizing the text and deciding whether periods are sentence ending or used for other purposes (abreviations, etc.). The

obtained tokens are then grouped into sentences.

Constant Summary collapse

DefaultOptions =
{
  :also_tokenize => false
}
@@segmenter =

Keep one copy of the Stanford Core NLP pipeline.

nil

Class Method Summary collapse

Class Method Details

.segment(entity, options = {}) ⇒ Object

Segment sentences using the sentence splitter supplied by the Stanford parser. For better performance, set the option :also_tokenize to true, and this segmenter will also add the tokens as children of the sentences.

Options:

  • (Boolean) :also_tokenize - Whether to also

add the tokens as children of the sentence.



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/treat/workers/processors/segmenters/stanford.rb', line 24

def self.segment(entity, options = {})

  Treat::Loaders::Stanford.load

  options = DefaultOptions.merge(options)
  entity.check_hasnt_children

  @@segmenter ||=
  ::StanfordCoreNLP.load(:tokenize, :ssplit)
  
  s = entity.to_s
  text = ::StanfordCoreNLP::Annotation.new(s)

  @@segmenter.annotate(text)
  text.get(:sentences).each do |sentence|
    sentence = sentence.to_s
    s = Treat::Entities::Sentence.
    from_string(sentence, true)
    entity << s
    if options[:also_tokenize]
      Treat::Workers::Processors::Tokenizers::Stanford.
      add_tokens(s, sentence.get(:tokens))
    end
  end

end