Class: PragmaticSegmenter::Cleaner

Inherits:

Object

Object
PragmaticSegmenter::Cleaner

show all

Includes:: Rules

Defined in:: lib/pragmatic_segmenter/cleaner.rb

Overview

This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.

Direct Known Subclasses

Constant Summary collapse

URL_EMAIL_KEYWORDS =

['@', 'http', '.com', 'net', 'www', '//']

NO_SPACE_BETWEEN_SENTENCES_REGEX = Rubular: rubular.com/r/6dt98uI76u

/(?<=[a-z])\.(?=[A-Z])/

NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX = Rubular: rubular.com/r/l6KN6rH5XE

/(?<=\d)\.(?=[A-Z])/

NewLineInMiddleOfWordRule = Rubular: rubular.com/r/V57WnM9Zut

Rule.new(/\n(?=[a-zA-Z]{1,2}\n)/, '')

NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX = Rubular: rubular.com/r/3GiRiP2IbD

/(?<=\s)\n(?=([a-z]|\())/

PDF_NewLineInMiddleOfSentenceRule = Rubular: rubular.com/r/UZAVcwqck8

Rule.new(/(?<=[^\n]\s)\n(?=\S)/, '')

PDF_NewLineInMiddleOfSentenceNoSpacesRule = Rubular: rubular.com/r/eaNwGavmdo

Rule.new(/\n(?=[a-z])/, ' ')

InlineFormattingRule = Rubular: rubular.com/r/bAJrhyLNeZ

Rule.new(/\{b\^&gt;\d*&lt;b\^\}|\{b\^>\d*<b\^\}/, '')

DoubleNewLineWithSpaceRule = Rubular: rubular.com/r/dMxp5MixFS

Rule.new(/\n \n/, "\r")

DoubleNewLineRule = Rubular: rubular.com/r/H6HOJeA8bq

Rule.new(/\n\n/, "\r")

NewLineFollowedByBulletRule = Rubular: rubular.com/r/Gn18aAnLdZ

Rule.new(/\n(?=•)/, "\r")

NewLineFollowedByPeriodRule = Rubular: rubular.com/r/FseyMiiYFT

Rule.new(/\n(?=\.(\s|\n))/, '')

TableOfContentsRule = Rubular: rubular.com/r/8mc1ArOIGy

Rule.new(/\.{5,}\s*\d+-*\d*/, "\r")

ConsecutivePeriodsRule = Rubular: rubular.com/r/DwNSuZrNtk

Rule.new(/\.{5,}/, ' ')

ConsecutiveForwardSlashRule = Rubular: rubular.com/r/IQ4TPfsbd8

Rule.new(/\/{3}/, '')

NoSpaceBetweenSentencesRule = Rubular: rubular.com/r/6dt98uI76u

Rule.new(NO_SPACE_BETWEEN_SENTENCES_REGEX, '. ')

NoSpaceBetweenSentencesDigitRule = Rubular: rubular.com/r/l6KN6rH5XE

Rule.new(NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, '. ')

EscapedCarriageReturnRule =

Rule.new(/\\r/, "\r")

TypoEscapedCarriageReturnRule =

Rule.new(/\\\ r/, "\r")

EscapedNewLineRule =

Rule.new(/\\n/, "\n")

TypoEscapedNewLineRule =

Rule.new(/\\\ n/, "\n")

ReplaceNewlineWithCarriageReturnRule =

Rule.new(/\n/, "\r")

QuotationsFirstRule =

Rule.new(/''/, '"')

QuotationsSecondRule =

Rule.new(/``/, '"')

Constants included from Rules

Rules::AbbreviationsWithMultiplePeriodsAndEmailRule, Rules::ExtraWhiteSpaceRule, Rules::GeoLocationRule, Rules::QuestionMarkInQuotationRule, Rules::SingleNewLineRule, Rules::SubSingleQuoteRule

Instance Attribute Summary collapse

#doc_type ⇒ Object readonly

Returns the value of attribute doc_type.
#text ⇒ Object readonly

Returns the value of attribute text.

Instance Method Summary collapse

#clean ⇒ Object

Clean text of unwanted formatting.
#initialize(text:, **args) ⇒ Cleaner constructor

A new instance of Cleaner.

Constructor Details

#initialize(text:, **args) ⇒ `Cleaner`

Returns a new instance of Cleaner.

# File 'lib/pragmatic_segmenter/cleaner.rb', line 82

def initialize(text:, **args)
  @text = Text.new(text.dup)
  @doc_type = args[:doc_type]
end

Instance Attribute Details

#doc_type ⇒ `Object` (readonly)

Returns the value of attribute doc_type.



81
82
83

# File 'lib/pragmatic_segmenter/cleaner.rb', line 81

def doc_type
  @doc_type
end

#text ⇒ `Object` (readonly)

Returns the value of attribute text.



81
82
83

# File 'lib/pragmatic_segmenter/cleaner.rb', line 81

def text
  @text
end

Instance Method Details

#clean ⇒ `Object`

Clean text of unwanted formatting

Example:

>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."

Arguments:

text:       (String)  *required
language:   (String)  *optional
            (two-digit ISO 639-1 code e.g. 'en')
doc_type:   (String)  *optional
            (e.g. 'pdf')

# File 'lib/pragmatic_segmenter/cleaner.rb', line 101

def clean
  return unless text
  @clean_text = remove_all_newlines(text)
  replace_double_newlines(@clean_text)
  replace_newlines(@clean_text)
  replace_escaped_newlines(@clean_text)
  @clean_text.apply(HtmlRules::All)
  replace_punctuation_in_brackets(@clean_text)
  @clean_text.apply(InlineFormattingRule)
  clean_quotations(@clean_text)
  clean_table_of_contents(@clean_text)
  check_for_no_space_in_between_sentences(@clean_text)
  clean_consecutive_characters(@clean_text)
end