Class: LLT::Segmenter
- Inherits:
-
Object
- Object
- LLT::Segmenter
- Includes:
- Constants::Abbreviations, Core::Serviceable
- Defined in:
- lib/llt/segmenter.rb,
lib/llt/segmenter/version.rb
Constant Summary collapse
- AWB =
Abbreviations with boundary e.g. bA
This doesn’t work in jruby (opened an issue at jruby/jruby#1269 ), so we have to change things as long as this is not fixed.
(?<=s|^) can be just b in MRI 2.0 and upwards
ALL_ABBRS_PIPED.split('|').map { |abbr| "(?<=\\s|^)#{abbr}" }.join('|')
- SENTENCE_CLOSER =
/(?<!#{AWB})\.(?!\.)|[;\?!:]/
- DIRECT_SPEECH_DELIMITER =
/['"”]/
- TRAILERS =
/\)|<\/.*?>/
- VERSION =
"0.0.2"
Class Method Summary collapse
Instance Method Summary collapse
Class Method Details
.default_options ⇒ Object
13 14 15 16 17 18 |
# File 'lib/llt/segmenter.rb', line 13 def self. { indexing: true, newline_boundary: 2 } end |
Instance Method Details
#segment(string, add_to: nil, **options) ⇒ Object
31 32 33 34 35 36 37 38 |
# File 'lib/llt/segmenter.rb', line 31 def segment(string, add_to: nil, **) setup() # dump whitespace at the beginning and end! string.strip! sentences = scan_through_string(StringScanner.new(string)) add_to << sentences if add_to.respond_to?(:<<) sentences end |