Class: LLT::Segmenter

Inherits:
Object
  • Object
show all
Includes:
Constants::Abbreviations, Core::Serviceable
Defined in:
lib/llt/segmenter.rb,
lib/llt/segmenter/version.rb

Constant Summary collapse

AWB =

Abbreviations with boundary e.g. bA

This doesn’t work in jruby (opened an issue at jruby/jruby#1269 ), so we have to change things as long as this is not fixed.

(?<=s|^) can be just b in MRI 2.0 and upwards

ALL_ABBRS_PIPED.split('|').map { |abbr| "(?<=\\s|^)#{abbr}" }.join('|')
SENTENCE_CLOSER =
/(?<!#{AWB})\.(?!\.)|[;\?!:]/
DIRECT_SPEECH_DELIMITER =
/['"”]/
TRAILERS =
/\)|<\/.*?>/
VERSION =
"0.0.2"

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.default_optionsObject



13
14
15
16
17
18
# File 'lib/llt/segmenter.rb', line 13

def self.default_options
  {
    indexing: true,
    newline_boundary: 2
  }
end

Instance Method Details

#segment(string, add_to: nil, **options) ⇒ Object



31
32
33
34
35
36
37
38
# File 'lib/llt/segmenter.rb', line 31

def segment(string, add_to: nil, **options)
  setup(options)
  # dump whitespace at the beginning and end!
  string.strip!
  sentences = scan_through_string(StringScanner.new(string))
  add_to << sentences if add_to.respond_to?(:<<)
  sentences
end