Class: Treat::Workers::Processors::Segmenters::Stanford
- Inherits:
-
Object
- Object
- Treat::Workers::Processors::Segmenters::Stanford
- Defined in:
- lib/treat/workers/processors/segmenters/stanford.rb
Overview
Detects sentence boundaries by first tokenizing the text and deciding whether periods are sentence ending or used for other purposes (abreviations, etc.). The
obtained tokens are then grouped into sentences.
Constant Summary collapse
- DefaultOptions =
{ :also_tokenize => false }
- @@segmenter =
Keep one copy of the Stanford Core NLP pipeline.
nil
Class Method Summary collapse
-
.segment(entity, options = {}) ⇒ Object
Segment sentences using the sentence splitter supplied by the Stanford parser.
Class Method Details
.segment(entity, options = {}) ⇒ Object
Segment sentences using the sentence splitter supplied by the Stanford parser. For better performance, set the option :also_tokenize to true, and this segmenter will also add the tokens as children of the sentences.
Options:
-
(Boolean) :also_tokenize - Whether to also
add the tokens as children of the sentence.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/treat/workers/processors/segmenters/stanford.rb', line 24 def self.segment(entity, = {}) Treat::Loaders::Stanford.load = DefaultOptions.merge() entity.check_hasnt_children @@segmenter ||= ::StanfordCoreNLP.load(:tokenize, :ssplit) s = entity.to_s text = ::StanfordCoreNLP::Annotation.new(s) @@segmenter.annotate(text) text.get(:sentences).each do |sentence| sentence = sentence.to_s s = Treat::Entities::Sentence. from_string(sentence, true) entity << s if [:also_tokenize] Treat::Workers::Processors::Tokenizers::Stanford. add_tokens(s, sentence.get(:tokens)) end end end |