Class: PragmaticSegmenter::Cleaner
- Inherits:
-
Object
- Object
- PragmaticSegmenter::Cleaner
- Includes:
- Rules
- Defined in:
- lib/pragmatic_segmenter/cleaner.rb
Overview
This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.
Direct Known Subclasses
Languages::Amharic::Cleaner, Languages::Arabic::Cleaner, Languages::Armenian::Cleaner, Languages::Burmese::Cleaner, Languages::Common::Cleaner, Languages::Deutsch::Cleaner, Languages::Dutch::Cleaner, Languages::English::Cleaner, Languages::French::Cleaner, Languages::Greek::Cleaner, Languages::Hindi::Cleaner, Languages::Italian::Cleaner, Languages::Japanese::Cleaner, Languages::Persian::Cleaner, Languages::Polish::Cleaner, Languages::Russian::Cleaner, Languages::Spanish::Cleaner, Languages::Urdu::Cleaner
Constant Summary collapse
- URL_EMAIL_KEYWORDS =
['@', 'http', '.com', 'net', 'www', '//']
- NO_SPACE_BETWEEN_SENTENCES_REGEX =
Rubular: rubular.com/r/6dt98uI76u
/(?<=[a-z])\.(?=[A-Z])/- NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX =
Rubular: rubular.com/r/l6KN6rH5XE
/(?<=\d)\.(?=[A-Z])/- NewLineInMiddleOfWordRule =
Rubular: rubular.com/r/V57WnM9Zut
Rule.new(/\n(?=[a-zA-Z]{1,2}\n)/, '')
- NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX =
Rubular: rubular.com/r/3GiRiP2IbD
/(?<=\s)\n(?=([a-z]|\())/- PDF_NewLineInMiddleOfSentenceRule =
Rubular: rubular.com/r/UZAVcwqck8
Rule.new(/(?<=[^\n]\s)\n(?=\S)/, '')
- PDF_NewLineInMiddleOfSentenceNoSpacesRule =
Rubular: rubular.com/r/eaNwGavmdo
Rule.new(/\n(?=[a-z])/, ' ')
- InlineFormattingRule =
Rubular: rubular.com/r/bAJrhyLNeZ
Rule.new(/\{b\^>\d*<b\^\}|\{b\^>\d*<b\^\}/, '')
- DoubleNewLineWithSpaceRule =
Rubular: rubular.com/r/dMxp5MixFS
Rule.new(/\n \n/, "\r")
- DoubleNewLineRule =
Rubular: rubular.com/r/H6HOJeA8bq
Rule.new(/\n\n/, "\r")
- NewLineFollowedByBulletRule =
Rubular: rubular.com/r/Gn18aAnLdZ
Rule.new(/\n(?=•)/, "\r")
- NewLineFollowedByPeriodRule =
Rubular: rubular.com/r/FseyMiiYFT
Rule.new(/\n(?=\.(\s|\n))/, '')
- TableOfContentsRule =
Rubular: rubular.com/r/8mc1ArOIGy
Rule.new(/\.{5,}\s*\d+-*\d*/, "\r")
- ConsecutivePeriodsRule =
Rubular: rubular.com/r/DwNSuZrNtk
Rule.new(/\.{5,}/, ' ')
- ConsecutiveForwardSlashRule =
Rubular: rubular.com/r/IQ4TPfsbd8
Rule.new(/\/{3}/, '')
- NoSpaceBetweenSentencesRule =
Rubular: rubular.com/r/6dt98uI76u
Rule.new(NO_SPACE_BETWEEN_SENTENCES_REGEX, '. ')
- NoSpaceBetweenSentencesDigitRule =
Rubular: rubular.com/r/l6KN6rH5XE
Rule.new(NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, '. ')
- EscapedCarriageReturnRule =
Rule.new(/\\r/, "\r")
- TypoEscapedCarriageReturnRule =
Rule.new(/\\\ r/, "\r")
- EscapedNewLineRule =
Rule.new(/\\n/, "\n")
- TypoEscapedNewLineRule =
Rule.new(/\\\ n/, "\n")
- ReplaceNewlineWithCarriageReturnRule =
Rule.new(/\n/, "\r")
- QuotationsFirstRule =
Rule.new(/''/, '"')
- QuotationsSecondRule =
Rule.new(/``/, '"')
Constants included from Rules
Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::QuestionMarkInQuotationRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule
Instance Attribute Summary collapse
-
#doc_type ⇒ Object
readonly
Returns the value of attribute doc_type.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Method Summary collapse
-
#clean ⇒ Object
Clean text of unwanted formatting.
-
#initialize(text:, **args) ⇒ Cleaner
constructor
A new instance of Cleaner.
Constructor Details
#initialize(text:, **args) ⇒ Cleaner
Returns a new instance of Cleaner.
82 83 84 85 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 82 def initialize(text:, **args) @text = Text.new(text.dup) @doc_type = args[:doc_type] end |
Instance Attribute Details
#doc_type ⇒ Object (readonly)
Returns the value of attribute doc_type.
81 82 83 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 81 def doc_type @doc_type end |
#text ⇒ Object (readonly)
Returns the value of attribute text.
81 82 83 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 81 def text @text end |
Instance Method Details
#clean ⇒ Object
Clean text of unwanted formatting
Example:
>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."
Arguments:
text: (String) *required
language: (String) *optional
(two-digit ISO 639-1 code e.g. 'en')
doc_type: (String) *optional
(e.g. 'pdf')
101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 101 def clean return unless text @clean_text = remove_all_newlines(text) replace_double_newlines(@clean_text) replace_newlines(@clean_text) replace_escaped_newlines(@clean_text) @clean_text.apply(HtmlRules::All) replace_punctuation_in_brackets(@clean_text) @clean_text.apply(InlineFormattingRule) clean_quotations(@clean_text) clean_table_of_contents(@clean_text) check_for_no_space_in_between_sentences(@clean_text) clean_consecutive_characters(@clean_text) end |