Class: Treat::Workers::Processors::Tokenizers::Stanford

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/processors/tokenizers/stanford.rb

Overview

Tokenization provided by Stanford Penn-Treebank style tokenizer. Most punctuation is split from adjoining words, verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. N.B. Contrary to the standard PTB tokenization, double quotes (“) are NOT changed to doubled single forward- and backward- quotes (“ and ”) by default.

Constant Summary collapse

DefaultOptions =

Default options for the tokenizer.

{
  directional_quotes: false,
  escape_characters: false
}
@@tokenizer =

Hold one instance of the tokenizer.

nil

Class Method Summary collapse

Class Method Details

.add_tokens(entity, tokens, options) ⇒ Object

Add the tokens to the entity.



37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/treat/workers/processors/tokenizers/stanford.rb', line 37

def self.add_tokens(entity, tokens, options)
  tokens.each do |token|
    val = token.value
     unless options[:escape_characters]
       Treat.tags.ptb.escape_characters.
          each do |key, value|
           val.gsub!(value, key)
         end
     end
    unless options[:directional_quotes]
      val.gsub!(/``/,'"') 
      val.gsub!(/''/,'"')
    end
    entity << Treat::Entities::Token.
    from_string(val)
  end
end

.tokenize(entity, options = {}) ⇒ Object

Perform tokenization of the entity and add the resulting tokens as its children.

Options:

  • (Boolean) :directional_quotes => Whether

to attempt to get correct directional quotes, replacing “…” by “…”. Off by default.



26
27
28
29
30
31
32
33
34
# File 'lib/treat/workers/processors/tokenizers/stanford.rb', line 26

def self.tokenize(entity, options = {})
  Treat::Loaders::Stanford.load
  options = DefaultOptions.merge(options)
  @@tokenizer ||= StanfordCoreNLP.load(:tokenize)
  entity.check_hasnt_children
  text = ::StanfordCoreNLP::Annotation.new(entity.to_s)
  @@tokenizer.annotate(text)
  add_tokens(entity, text.get(:tokens), options)
end