Class: PragmaticSegmenter::Cleaner

Inherits:
Object
  • Object
show all
Includes:
Rules
Defined in:
lib/pragmatic_segmenter/cleaner.rb

Overview

This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.

Constant Summary collapse

URL_EMAIL_KEYWORDS =
['@', 'http', '.com', 'net', 'www', '//']
NO_SPACE_BETWEEN_SENTENCES_REGEX =
/(?<=[a-z])\.(?=[A-Z])/
NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX =
/(?<=\d)\.(?=[A-Z])/
NewLineInMiddleOfWordRule =
Rule.new(/\n(?=[a-zA-Z]{1,2}\n)/, '')
NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX =
/(?<=\s)\n(?=([a-z]|\())/
PDF_NewLineInMiddleOfSentenceRule =
Rule.new(/(?<=[^\n]\s)\n(?=\S)/, '')
PDF_NewLineInMiddleOfSentenceNoSpacesRule =
Rule.new(/\n(?=[a-z])/, ' ')
InlineFormattingRule =
Rule.new(/\{b\^&gt;\d*&lt;b\^\}|\{b\^>\d*<b\^\}/, '')
DoubleNewLineWithSpaceRule =
Rule.new(/\n \n/, "\r")
DoubleNewLineRule =
Rule.new(/\n\n/, "\r")
NewLineFollowedByBulletRule =
Rule.new(/\n(?=•)/, "\r")
NewLineFollowedByPeriodRule =
Rule.new(/\n(?=\.(\s|\n))/, '')
TableOfContentsRule =
Rule.new(/\.{5,}\s*\d+-*\d*/, "\r")
ConsecutivePeriodsRule =
Rule.new(/\.{5,}/, ' ')
ConsecutiveForwardSlashRule =
Rule.new(/\/{3}/, '')
NoSpaceBetweenSentencesRule =
Rule.new(NO_SPACE_BETWEEN_SENTENCES_REGEX, '. ')
NoSpaceBetweenSentencesDigitRule =
Rule.new(NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, '. ')
EscapedCarriageReturnRule =
Rule.new(/\\r/, "\r")
TypoEscapedCarriageReturnRule =
Rule.new(/\\\ r/, "\r")
EscapedNewLineRule =
Rule.new(/\\n/, "\n")
TypoEscapedNewLineRule =
Rule.new(/\\\ n/, "\n")
ReplaceNewlineWithCarriageReturnRule =
Rule.new(/\n/, "\r")
QuotationsFirstRule =
Rule.new(/''/, '"')
QuotationsSecondRule =
Rule.new(/``/, '"')

Constants included from Rules

Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::QuestionMarkInQuotationRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text:, **args) ⇒ Cleaner

Returns a new instance of Cleaner.



82
83
84
85
# File 'lib/pragmatic_segmenter/cleaner.rb', line 82

def initialize(text:, **args)
  @text = Text.new(text.dup)
  @doc_type = args[:doc_type]
end

Instance Attribute Details

#doc_typeObject (readonly)

Returns the value of attribute doc_type.



81
82
83
# File 'lib/pragmatic_segmenter/cleaner.rb', line 81

def doc_type
  @doc_type
end

#textObject (readonly)

Returns the value of attribute text.



81
82
83
# File 'lib/pragmatic_segmenter/cleaner.rb', line 81

def text
  @text
end

Instance Method Details

#cleanObject

Clean text of unwanted formatting

Example:

>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."

Arguments:

text:       (String)  *required
language:   (String)  *optional
            (two-digit ISO 639-1 code e.g. 'en')
doc_type:   (String)  *optional
            (e.g. 'pdf')


101
102
103
104
105
106
107
108
109
110
111
112
113
114
# File 'lib/pragmatic_segmenter/cleaner.rb', line 101

def clean
  return unless text
  @clean_text = remove_all_newlines(text)
  replace_double_newlines(@clean_text)
  replace_newlines(@clean_text)
  replace_escaped_newlines(@clean_text)
  @clean_text.apply(HtmlRules::All)
  replace_punctuation_in_brackets(@clean_text)
  @clean_text.apply(InlineFormattingRule)
  clean_quotations(@clean_text)
  clean_table_of_contents(@clean_text)
  check_for_no_space_in_between_sentences(@clean_text)
  clean_consecutive_characters(@clean_text)
end