Class: PragmaticSegmenter::Process
- Inherits:
-
Object
- Object
- PragmaticSegmenter::Process
- Includes:
- Rules
- Defined in:
- lib/pragmatic_segmenter/process.rb
Overview
This class processing segmenting the text.
Direct Known Subclasses
Languages::Amharic::Process, Languages::Arabic::Process, Languages::Armenian::Process, Languages::Burmese::Process, Languages::Common::Process, Languages::Deutsch::Process, Languages::Dutch::Process, Languages::English::Process, Languages::French::Process, Languages::Greek::Process, Languages::Hindi::Process, Languages::Italian::Process, Languages::Japanese::Process, Languages::Persian::Process, Languages::Polish::Process, Languages::Russian::Process, Languages::Spanish::Process, Languages::Urdu::Process
Constant Summary collapse
- QUOTATION_AT_END_OF_SENTENCE_REGEX =
Rubular: rubular.com/r/NqCqv372Ix
/[!?\.-][\"\'\u{201d}\u{201c}]\s{1}[A-Z]/- PARENS_BETWEEN_DOUBLE_QUOTES_REGEX =
Rubular: rubular.com/r/6flGnUMEVl
/["”]\s\(.*\)\s["“]/- BETWEEN_DOUBLE_QUOTES_REGEX =
Rubular: rubular.com/r/TYzr4qOW1Q
/"(?:[^"])*[^,]"|“(?:[^”])*[^,]”/- SPLIT_SPACE_QUOTATION_AT_END_OF_SENTENCE_REGEX =
Rubular: rubular.com/r/JMjlZHAT4g
/(?<=[!?\.-][\"\'\u{201d}\u{201c}])\s{1}(?=[A-Z])/- CONTINUOUS_PUNCTUATION_REGEX =
Rubular: rubular.com/r/mQ8Es9bxtk
/(?<=\S)(!|\?){3,}(?=(\s|\z|$))/
Constants included from Rules
Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::QuestionMarkInQuotationRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule
Instance Attribute Summary collapse
-
#doc_type ⇒ Object
readonly
Returns the value of attribute doc_type.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Method Summary collapse
-
#initialize(text:, doc_type:) ⇒ Process
constructor
A new instance of Process.
- #process ⇒ Object
Constructor Details
#initialize(text:, doc_type:) ⇒ Process
Returns a new instance of Process.
32 33 34 35 |
# File 'lib/pragmatic_segmenter/process.rb', line 32 def initialize(text:, doc_type:) @text = text @doc_type = doc_type end |
Instance Attribute Details
#doc_type ⇒ Object (readonly)
Returns the value of attribute doc_type.
31 32 33 |
# File 'lib/pragmatic_segmenter/process.rb', line 31 def doc_type @doc_type end |
#text ⇒ Object (readonly)
Returns the value of attribute text.
31 32 33 |
# File 'lib/pragmatic_segmenter/process.rb', line 31 def text @text end |
Instance Method Details
#process ⇒ Object
37 38 39 40 41 42 43 44 45 |
# File 'lib/pragmatic_segmenter/process.rb', line 37 def process reformatted_text = PragmaticSegmenter::List.new(text: text).add_line_break reformatted_text = replace_abbreviations(reformatted_text) reformatted_text = replace_numbers(reformatted_text) reformatted_text = replace_continuous_punctuation(reformatted_text) reformatted_text.apply(AbbreviationsWithMultiplePeriodsAndEmailRule) reformatted_text.apply(GeoLocationRule) split_into_segments(reformatted_text) end |