Class: PragmaticSegmenter::Cleaner

Inherits:
Object
  • Object
show all
Includes:
Rules
Defined in:
lib/pragmatic_segmenter/cleaner.rb

Overview

This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.

Constant Summary

Constants included from Rules

Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ConsecutiveForwardSlashRule, Rules::ConsecutivePeriodsRule, Rules::DoubleNewLineRule, Rules::DoubleNewLineWithSpaceRule, Rules::EscapedCarriageReturnRule, Rules::EscapedNewLineRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::InlineFormattingRule, Rules::NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_REGEX, Rules::NewLineFollowedByBulletRule, Rules::NewLineFollowedByPeriodRule, Rules::NewLineInMiddleOfWordRule, Rules::NoSpaceBetweenSentencesDigitRule, Rules::NoSpaceBetweenSentencesRule, Rules::PDF_NewLineInMiddleOfSentenceNoSpacesRule, Rules::PDF_NewLineInMiddleOfSentenceRule, Rules::QuestionMarkInQuotationRule, Rules::QuotationsFirstRule, Rules::QuotationsSecondRule, Rules::ReplaceNewlineWithCarriageReturnRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule, Rules::TableOfContentsRule, Rules::TypoEscapedCarriageReturnRule, Rules::TypoEscapedNewLineRule, Rules::URL_EMAIL_KEYWORDS

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text:, doc_type: nil, language: Languages::Common, **args) ⇒ Cleaner

Returns a new instance of Cleaner.



10
11
12
13
14
# File 'lib/pragmatic_segmenter/cleaner.rb', line 10

def initialize(text:, doc_type: nil, language: Languages::Common, **args)
  @text = Text.new(text.dup)
  @doc_type = doc_type
  @language = language
end

Instance Attribute Details

#doc_typeObject (readonly)

Returns the value of attribute doc_type.



9
10
11
# File 'lib/pragmatic_segmenter/cleaner.rb', line 9

def doc_type
  @doc_type
end

#textObject (readonly)

Returns the value of attribute text.



9
10
11
# File 'lib/pragmatic_segmenter/cleaner.rb', line 9

def text
  @text
end

Instance Method Details

#cleanObject

Clean text of unwanted formatting

Example:

>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."

Arguments:

text:       (String)  *required
language:   (String)  *optional
            (two character ISO 639-1 code e.g. 'en')
doc_type:   (String)  *optional
            (e.g. 'pdf')


30
31
32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/pragmatic_segmenter/cleaner.rb', line 30

def clean
  return unless text
  @clean_text = remove_all_newlines(text)
  replace_double_newlines(@clean_text)
  replace_newlines(@clean_text)
  replace_escaped_newlines(@clean_text)
  @clean_text.apply(HTMLRules::All)
  replace_punctuation_in_brackets(@clean_text)
  @clean_text.apply(InlineFormattingRule)
  clean_quotations(@clean_text)
  clean_table_of_contents(@clean_text)
  check_for_no_space_in_between_sentences(@clean_text)
  clean_consecutive_characters(@clean_text)
end