Module: PragmaticSegmenter::Languages::Common
- Includes:
- Rules
- Included in:
- Amharic, Arabic, Armenian, Burmese, Chinese, Deutsch, Dutch, English, French, Greek, Hindi, Italian, Japanese, Persian, Polish, Russian, Spanish, Urdu
- Defined in:
- lib/pragmatic_segmenter/languages/common.rb
Defined Under Namespace
Modules: Abbreviation, AmPmRules, SingleLetterAbbreviationRules Classes: Cleaner, Process
Constant Summary collapse
- Punctuations =
This class holds the punctuation marks.
['。', '.', '.', '!', '!', '?', '?']
- SENTENCE_BOUNDARY_REGEX =
/\u{ff08}(?:[^\u{ff09}])*\u{ff09}(?=\s?[A-Z])|\u{300c}(?:[^\u{300d}])*\u{300d}(?=\s[A-Z])|\((?:[^\)]){2,}\)(?=\s[A-Z])|'(?:[^'])*[^,]'(?=\s[A-Z])|"(?:[^"])*[^,]"(?=\s[A-Z])|“(?:[^”])*[^,]”(?=\s[A-Z])|\S.*?[。..!!??ȸȹ☉☈☇☄]/- QUOTATION_AT_END_OF_SENTENCE_REGEX =
Rubular: rubular.com/r/NqCqv372Ix
/[!?\.-][\"\'\u{201d}\u{201c}]\s{1}[A-Z]/- PARENS_BETWEEN_DOUBLE_QUOTES_REGEX =
Rubular: rubular.com/r/6flGnUMEVl
/["”]\s\(.*\)\s["“]/- BETWEEN_DOUBLE_QUOTES_REGEX =
Rubular: rubular.com/r/TYzr4qOW1Q
/"(?:[^"])*[^,]"|“(?:[^”])*[^,]”/- SPLIT_SPACE_QUOTATION_AT_END_OF_SENTENCE_REGEX =
Rubular: rubular.com/r/JMjlZHAT4g
/(?<=[!?\.-][\"\'\u{201d}\u{201c}])\s{1}(?=[A-Z])/- CONTINUOUS_PUNCTUATION_REGEX =
Rubular: rubular.com/r/mQ8Es9bxtk
/(?<=\S)(!|\?){3,}(?=(\s|\z|$))/- PossessiveAbbreviationRule =
Rubular: rubular.com/r/yqa4Rit8EY
Rule.new(/\.(?='s\s)|\.(?='s$)|\.(?='s\z)/, '∯')
- KommanditgesellschaftRule =
Rubular: rubular.com/r/NEv265G2X2
Rule.new(/(?<=Co)\.(?=\sKG)/, '∯')
- MULTI_PERIOD_ABBREVIATION_REGEX =
Rubular: rubular.com/r/xDkpFZ0EgH
/\b[a-z](?:\.[a-z])+[.]/i
Constants included from Rules
Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ConsecutiveForwardSlashRule, Rules::ConsecutivePeriodsRule, Rules::DoubleNewLineRule, Rules::DoubleNewLineWithSpaceRule, Rules::EscapedCarriageReturnRule, Rules::EscapedNewLineRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::InlineFormattingRule, Rules::NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_REGEX, Rules::NewLineFollowedByBulletRule, Rules::NewLineFollowedByPeriodRule, Rules::NewLineInMiddleOfWordRule, Rules::NoSpaceBetweenSentencesDigitRule, Rules::NoSpaceBetweenSentencesRule, Rules::PDF_NewLineInMiddleOfSentenceNoSpacesRule, Rules::PDF_NewLineInMiddleOfSentenceRule, Rules::QuestionMarkInQuotationRule, Rules::QuotationsFirstRule, Rules::QuotationsSecondRule, Rules::ReplaceNewlineWithCarriageReturnRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule, Rules::TableOfContentsRule, Rules::TypoEscapedCarriageReturnRule, Rules::TypoEscapedNewLineRule, Rules::URL_EMAIL_KEYWORDS