Class: Treat::Workers::Processors::Segmenters::Tactful
- Inherits:
-
Object
- Object
- Treat::Workers::Processors::Segmenters::Tactful
- Defined in:
- lib/treat/workers/processors/segmenters/tactful.rb
Overview
Sentence segmentation based on a Naive Bayesian statistical model. Trained on Wall Street Journal news combined with the Brown Corpus, which is intended to be widely representative of written English.
Original paper: Dan Gillick. 2009. Sentence Boundary Detection and the Problem with the U.S. University of California, Berkeley.
Constant Summary collapse
- @@segmenter =
Keep only one copy of the segmenter.
nil
Class Method Summary collapse
-
.segment(entity, options = {}) ⇒ Object
Segment a text or zone into sentences using the ‘tactful_tokenizer’ gem.
Class Method Details
.segment(entity, options = {}) ⇒ Object
Segment a text or zone into sentences using the ‘tactful_tokenizer’ gem.
Options: none.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# File 'lib/treat/workers/processors/segmenters/tactful.rb', line 24 def self.segment(entity, = {}) entity.check_hasnt_children s = entity.to_s s.escape_floats! # Remove abbreviations. s.scan(/(?:[A-Za-z]\.){2,}/).each do |abbr| s.gsub!(abbr, abbr.gsub(' ', '').gsub('.', '&-&')) end # Take out suspension points temporarily. s.gsub!('...', '&;&.') # Unstick sentences from each other. s.gsub!(/([^\.\?!]\.|\!|\?)([^\s"'])/) { $1 + ' ' + $2 } @@segmenter ||= TactfulTokenizer::Model.new sentences = @@segmenter.tokenize_text(s) sentences.each do |sentence| sentence.unescape_floats! # Repair abbreviations. sentence.gsub!('&-&', '.') # Repair suspension points. sentence.gsub!('&;&.', '...') entity << Treat::Entities::Phrase.from_string(sentence) end end |