Class: PragmaticSegmenter::Process

Inherits:
Object
  • Object
show all
Includes:
Rules
Defined in:
lib/pragmatic_segmenter/process.rb

Overview

This class processing segmenting the text.

Constant Summary collapse

QUOTATION_AT_END_OF_SENTENCE_REGEX =
/[!?\.-][\"\'\u{201d}\u{201c}]\s{1}[A-Z]/
PARENS_BETWEEN_DOUBLE_QUOTES_REGEX =
/["”]\s\(.*\)\s["“]/
BETWEEN_DOUBLE_QUOTES_REGEX =
/"(?:[^"])*[^,]"|“(?:[^”])*[^,]”/
SPLIT_SPACE_QUOTATION_AT_END_OF_SENTENCE_REGEX =
/(?<=[!?\.-][\"\'\u{201d}\u{201c}])\s{1}(?=[A-Z])/
CONTINUOUS_PUNCTUATION_REGEX =
/(?<=\S)(!|\?){3,}(?=(\s|\z|$))/

Constants included from Rules

Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::QuestionMarkInQuotationRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text:, doc_type:) ⇒ Process

Returns a new instance of Process.



32
33
34
35
# File 'lib/pragmatic_segmenter/process.rb', line 32

def initialize(text:, doc_type:)
  @text = text
  @doc_type = doc_type
end

Instance Attribute Details

#doc_typeObject (readonly)

Returns the value of attribute doc_type.



31
32
33
# File 'lib/pragmatic_segmenter/process.rb', line 31

def doc_type
  @doc_type
end

#textObject (readonly)

Returns the value of attribute text.



31
32
33
# File 'lib/pragmatic_segmenter/process.rb', line 31

def text
  @text
end

Instance Method Details

#processObject



37
38
39
40
41
42
43
44
45
# File 'lib/pragmatic_segmenter/process.rb', line 37

def process
  reformatted_text = PragmaticSegmenter::List.new(text: text).add_line_break
  reformatted_text = replace_abbreviations(reformatted_text)
  reformatted_text = replace_numbers(reformatted_text)
  reformatted_text = replace_continuous_punctuation(reformatted_text)
  reformatted_text.apply(AbbreviationsWithMultiplePeriodsAndEmailRule)
  reformatted_text.apply(GeoLocationRule)
  split_into_segments(reformatted_text)
end