Class: PragmaticSegmenter::Cleaner
- Inherits:
-
Object
- Object
- PragmaticSegmenter::Cleaner
- Includes:
- Rules
- Defined in:
- lib/pragmatic_segmenter/cleaner.rb
Overview
This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.
Direct Known Subclasses
Languages::Common::Cleaner, Languages::Deutsch::Cleaner, Languages::English::Cleaner, Languages::Japanese::Cleaner, Languages::Spanish::Cleaner
Constant Summary
Constants included from Rules
Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ConsecutiveForwardSlashRule, Rules::ConsecutivePeriodsRule, Rules::DoubleNewLineRule, Rules::DoubleNewLineWithSpaceRule, Rules::EscapedCarriageReturnRule, Rules::EscapedNewLineRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::InlineFormattingRule, Rules::NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_REGEX, Rules::NewLineFollowedByBulletRule, Rules::NewLineFollowedByPeriodRule, Rules::NewLineInMiddleOfWordRule, Rules::NoSpaceBetweenSentencesDigitRule, Rules::NoSpaceBetweenSentencesRule, Rules::PDF_NewLineInMiddleOfSentenceNoSpacesRule, Rules::PDF_NewLineInMiddleOfSentenceRule, Rules::QuestionMarkInQuotationRule, Rules::QuotationsFirstRule, Rules::QuotationsSecondRule, Rules::ReplaceNewlineWithCarriageReturnRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule, Rules::TableOfContentsRule, Rules::TypoEscapedCarriageReturnRule, Rules::TypoEscapedNewLineRule, Rules::URL_EMAIL_KEYWORDS
Instance Attribute Summary collapse
-
#doc_type ⇒ Object
readonly
Returns the value of attribute doc_type.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Method Summary collapse
-
#clean ⇒ Object
Clean text of unwanted formatting.
-
#initialize(text:, doc_type: nil, language: Languages::Common, **args) ⇒ Cleaner
constructor
A new instance of Cleaner.
Constructor Details
#initialize(text:, doc_type: nil, language: Languages::Common, **args) ⇒ Cleaner
Returns a new instance of Cleaner.
10 11 12 13 14 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 10 def initialize(text:, doc_type: nil, language: Languages::Common, **args) @text = Text.new(text.dup) @doc_type = doc_type @language = language end |
Instance Attribute Details
#doc_type ⇒ Object (readonly)
Returns the value of attribute doc_type.
9 10 11 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 9 def doc_type @doc_type end |
#text ⇒ Object (readonly)
Returns the value of attribute text.
9 10 11 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 9 def text @text end |
Instance Method Details
#clean ⇒ Object
Clean text of unwanted formatting
Example:
>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."
Arguments:
text: (String) *required
language: (String) *optional
(two character ISO 639-1 code e.g. 'en')
doc_type: (String) *optional
(e.g. 'pdf')
30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 30 def clean return unless text @clean_text = remove_all_newlines(text) replace_double_newlines(@clean_text) replace_newlines(@clean_text) replace_escaped_newlines(@clean_text) @clean_text.apply(HTMLRules::All) replace_punctuation_in_brackets(@clean_text) @clean_text.apply(InlineFormattingRule) clean_quotations(@clean_text) clean_table_of_contents(@clean_text) check_for_no_space_in_between_sentences(@clean_text) clean_consecutive_characters(@clean_text) end |