Class: PragmaticSegmenter::Cleaner
- Inherits:
-
Object
- Object
- PragmaticSegmenter::Cleaner
- Includes:
- Rules
- Defined in:
- lib/pragmatic_segmenter/cleaner.rb,
lib/pragmatic_segmenter/cleaner/rules.rb
Overview
This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.
Direct Known Subclasses
Defined Under Namespace
Modules: Rules
Constant Summary
Constants included from Rules
Rules::ConsecutiveForwardSlashRule, Rules::ConsecutivePeriodsRule, Rules::DoubleNewLineRule, Rules::DoubleNewLineWithSpaceRule, Rules::EscapedCarriageReturnRule, Rules::EscapedNewLineRule, Rules::InlineFormattingRule, Rules::NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_REGEX, Rules::NewLineFollowedByBulletRule, Rules::NewLineFollowedByPeriodRule, Rules::NewLineInMiddleOfWordRule, Rules::NoSpaceBetweenSentencesDigitRule, Rules::NoSpaceBetweenSentencesRule, Rules::QuotationsFirstRule, Rules::QuotationsSecondRule, Rules::ReplaceNewlineWithCarriageReturnRule, Rules::TableOfContentsRule, Rules::TypoEscapedCarriageReturnRule, Rules::TypoEscapedNewLineRule, Rules::URL_EMAIL_KEYWORDS
Instance Attribute Summary collapse
-
#doc_type ⇒ Object
readonly
Returns the value of attribute doc_type.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Method Summary collapse
-
#clean ⇒ Object
Clean text of unwanted formatting.
-
#initialize(text:, doc_type: nil, language: Languages::Common) ⇒ Cleaner
constructor
A new instance of Cleaner.
Constructor Details
#initialize(text:, doc_type: nil, language: Languages::Common) ⇒ Cleaner
Returns a new instance of Cleaner.
11 12 13 14 15 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 11 def initialize(text:, doc_type: nil, language: Languages::Common) @text = Text.new(text) @doc_type = doc_type @language = language end |
Instance Attribute Details
#doc_type ⇒ Object (readonly)
Returns the value of attribute doc_type.
10 11 12 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 10 def doc_type @doc_type end |
#text ⇒ Object (readonly)
Returns the value of attribute text.
10 11 12 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 10 def text @text end |
Instance Method Details
#clean ⇒ Object
Clean text of unwanted formatting
Example:
>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."
Arguments:
text: (String) *required
language: (String) *optional
(two character ISO 639-1 code e.g. 'en')
doc_type: (String) *optional
(e.g. 'pdf')
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 31 def clean return unless text remove_all_newlines replace_double_newlines replace_newlines replace_escaped_newlines @text.apply(HTML::All) replace_punctuation_in_brackets @text.apply(InlineFormattingRule) clean_quotations clean_table_of_contents check_for_no_space_in_between_sentences clean_consecutive_characters end |