Class: Treat::Workers::Processors::Segmenters::Tactful

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/processors/segmenters/tactful.rb

Overview

Sentence segmentation based on a Naive Bayesian statistical model. Trained on Wall Street Journal news combined with the Brown Corpus, which is intended to be widely representative of written English.

Original paper: Dan Gillick. 2009. Sentence Boundary Detection and the Problem with the U.S. University of California, Berkeley.

Constant Summary collapse

@@segmenter =

Keep only one copy of the segmenter.

nil

Class Method Summary collapse

Class Method Details

.segment(entity, options = {}) ⇒ Object

Segment a text or zone into sentences using the ‘tactful_tokenizer’ gem.

Options: none.



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/treat/workers/processors/segmenters/tactful.rb', line 24

def self.segment(entity, options = {})

  entity.check_hasnt_children
  
  s = entity.to_s
  s.escape_floats!
  
  # Remove abbreviations.
  s.scan(/(?:[A-Za-z]\.){2,}/).each do |abbr| 
    s.gsub!(abbr, abbr.gsub(' ', '').gsub('.', '&-&'))
  end
  
  # Take out suspension points temporarily.
  s.gsub!('...', '&;&.')
  # Unstick sentences from each other.
  s.gsub!(/([^\.\?!]\.|\!|\?)([^\s"'])/) { $1 + ' ' + $2 }
  
  @@segmenter ||= TactfulTokenizer::Model.new
 
  sentences = @@segmenter.tokenize_text(s)
  
  sentences.each do |sentence|
    sentence.unescape_floats!
    # Repair abbreviations.
    sentence.gsub!('&-&', '.')
    # Repair suspension points.
    sentence.gsub!('&;&.', '...')
    entity << Treat::Entities::Phrase.from_string(sentence)
  end
  
end