Class: Treat::Workers::Processors::Tokenizers::PTB

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/processors/tokenizers/ptb.rb

Overview

Tokenization based on the tokenizer developped by Robert Macyntyre in 1995 for the Penn Treebank project. This tokenizer mostly follows the conventions used by the Penn Treebank. N.B. Contrary to the standard PTB tokenization, double quotes (“) are NOT changed to doubled single forward- and backward- quotes (“ and ”) by default.

Authors: Utiyama Masao ([email protected]). License: Ruby License.

Constant Summary collapse

DefaultOptions =

Default options for the tokenizer.

{
  directional_quotes: false
}

Class Method Summary collapse

Class Method Details

.split(string, options) ⇒ Object



41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# File 'lib/treat/workers/processors/tokenizers/ptb.rb', line 41

def self.split(string, options)

  s = " " + string + " "

  s.gsub!(//,"'")
  s.gsub!(//,"'")
  s.gsub!(//,"``")
  s.gsub!(//,"''")

  s.gsub!(/\s+/," ")
  s.gsub!(/(\s+)''/,'\1"')
  s.gsub!(/(\s+)``/,'\1"')
  s.gsub!(/''(\s+)/,'"\1')
  s.gsub!(/``(\s+)/,'"\1')
  s.gsub!(/ (['`]+)([^0-9].+) /,' \1 \2 ')
  s.gsub!(/([ (\[{<])"/,'\1 `` ')
  s.gsub!(/\.\.\./,' ... ')
  s.gsub!(/[,;:@\#$%&]/,' \& ')
  s.gsub!(/([^.])([.])([\])}>"']*)[ 	]*$/,'\1 \2\3 ')
  s.gsub!(/[?!]/,' \& ')
  s.gsub!(/[\]\[(){}<>]/,' \& ')
  s.gsub!(/--/,' -- ')
  s.sub!(/$/,' ')
  s.sub!(/^/,' ')
  s.gsub!(/"/,' \'\' ')
  s.gsub!(/([^'])' /,'\1 \' ')
  s.gsub!(/'([sSmMdD]) /,' \'\1 ')
  s.gsub!(/'ll /,' \'ll ')
  s.gsub!(/'re /,' \'re ')
  s.gsub!(/'ve /,' \'ve ')
  s.gsub!(/n't /,' n\'t ')
  s.gsub!(/'LL /,' \'LL ')
  s.gsub!(/'RE /,' \'RE ')
  s.gsub!(/'VE /,' \'VE ')
  s.gsub!(/N'T /,' N\'T ')
  s.gsub!(/ ([Cc])annot /,' \1an not ')
  s.gsub!(/ ([Dd])'ye /,' \1\' ye ')
  s.gsub!(/ ([Gg])imme /,' \1im me ')
  s.gsub!(/ ([Gg])onna /,' \1on na ')
  s.gsub!(/ ([Gg])otta /,' \1ot ta ')
  s.gsub!(/ ([Ll])emme /,' \1em me ')
  s.gsub!(/ ([Mm])ore'n /,' \1ore \'n ')
  s.gsub!(/ '([Tt])is /,' \'\1 is ')
  s.gsub!(/ '([Tt])was /,' \'\1 was ')
  s.gsub!(/ ([Ww])anna /,' \1an na ')
  while s.sub!(/(\s)([0-9]+) , ([0-9]+)(\s)/, '\1\2,\3\4'); end
  s.gsub!(/\//, ' / ')
  s.gsub!(/\s+/,' ')
  s.strip!
  
  # Remove directional quotes.
  unless options[:directional_quotes]
    s.gsub!(/``/,'"')
    s.gsub!(/''/,'"')
  end

  s.split(/\s+/)
end

.tokenize(entity, options = {}) ⇒ Object

Perform tokenization of the entity and add the resulting tokens as its children.

Options:

  • (Boolean) => :directional_quotes whether to

replace double quotes by “ and ” or not.



25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# File 'lib/treat/workers/processors/tokenizers/ptb.rb', line 25

def self.tokenize(entity, options = {})
  options = DefaultOptions.merge(options)
  entity.check_hasnt_children
  if entity.has_children?
    raise Treat::Exception,
    "Cannot tokenize an #{entity.class} " +
    "that already has children."
  end
  chunks = split(entity.to_s, options)
  chunks.each do |chunk|
    next if chunk =~ /([[:space:]]+)/
    entity << Treat::Entities::Token.
    from_string(chunk)
  end
end