Class: Treat::Workers::Processors::Segmenters::Punkt

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/processors/segmenters/punkt.rb

Overview

Sentence segmentation based on a set of log- likelihood-based heuristics to infer abbreviations and common sentence starters from a large text corpus. Easily adaptable but requires a large (unlabeled) indomain corpus for assembling statistics.

Original paper: Kiss, Tibor and Strunk, Jan. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32:485-525.

Constant Summary collapse

@@segmenters =

Hold one copy of the segmenter per language.

{}
@@trainers =

Hold only one trainer per language.

{}

Class Method Summary collapse

Class Method Details

.segment(entity, options = {}) ⇒ Object

Segment a text using the Punkt segmenter gem. The included models for this segmenter have been trained on one or two lengthy books from the corresponding language.

Options:

(String) :training_text => Text to train on.



32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# File 'lib/treat/workers/processors/segmenters/punkt.rb', line 32

def self.segment(entity, options = {})
  
  entity.check_hasnt_children
  
  lang = entity.language
  set_options(lang, options)
  
 
  s = entity.to_s
  
  # Replace the point in all floating-point numbers
  # by ^^; this is a fix since Punkt trips on decimal 
  # numbers.
  s.escape_floats!
  
  # Take out suspension points temporarily.
  s.gsub!('...', '&;&.')
  # Remove abbreviations.
  s.scan(/(?:[A-Za-z]\.){2,}/).each do |abbr| 
    s.gsub!(abbr, abbr.gsub(' ', '').gsub('.', '&-&'))
  end
  # Unstick sentences from each other.
  s.gsub!(/([^\.\?!]\.|\!|\?)([^\s"'])/) { $1 + ' ' + $2 }
  
  result = @@segmenters[lang].
  sentences_from_text(s, 
  :output => :sentences_text)
  
  result.each do |sentence|
    # Unescape the sentence.
    sentence.unescape_floats!
    # Repair abbreviations in sentences.
    sentence.gsub!('&-&', '.')
    # Repair suspension points.
    sentence.gsub!('&;&.', '...')
    entity << Treat::Entities::Phrase.
      from_string(sentence)
  end
  
end

.set_options(lang, options) ⇒ Object



73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# File 'lib/treat/workers/processors/segmenters/punkt.rb', line 73

def self.set_options(lang, options)
  
  return @@segmenters[lang] if @@segmenters[lang]
  
  if options[:model]
    model = options[:model]
  else
    model_path = Treat.libraries.punkt.model_path || 
    Treat.paths.models + 'punkt/'
    model = model_path + "#{lang}.yaml"
    unless File.readable?(model)
      raise Treat::Exception,
      "Could not get the language model " +
      "for the Punkt segmenter for #{lang.to_s.capitalize}."
    end
  end
  
  t = ::YAML.load(File.read(model))

  @@segmenters[lang] =
  ::Punkt::SentenceTokenizer.new(t)
  
end