Class: Treat::Workers::Processors::Segmenters::SRX

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/processors/segmenters/srx.rb

Overview

Sentence segmentation based on a set of predefined rules defined in SRX (Segmentation Rules eXchange) format and developped by Marcin Milkowski.

Original paper: Marcin Miłkowski, Jarosław Lipski,

  1. Using SRX standard for sentence segmentation

in LanguageTool, in: Human Language Technologies as a Challenge for Computer Science and Linguistics.

Constant Summary collapse

@@segmenters =
{}

Class Method Summary collapse

Class Method Details

.segment(entity, options = {}) ⇒ Object

Require the srx-english library. Segment a text using the SRX algorithm



15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# File 'lib/treat/workers/processors/segmenters/srx.rb', line 15

def self.segment(entity, options = {})

  lang = entity.language
  entity.check_hasnt_children
  text = entity.to_s
  text.escape_floats!

  unless @@segmenters[lang]
    # Require the appropriate gem.
    require "srx/#{lang}/sentence_splitter"
    @@segmenters[lang] = SRX.const_get(
    lang.capitalize).const_get(
    'SentenceSplitter')
  end

  sentences = @@segmenters[lang].new(text)

  sentences.each do |sentence|
    sentence.unescape_floats!
    entity << Treat::Entities::Phrase.
    from_string(sentence.strip)
  end

  entity

end