Top Level Namespace

Defined Under Namespace

Modules: Rblines

Constant Summary collapse

TOKENIZER =

This regular expression matches a group of characters that can include any character except for parentheses and whitespace characters (which include spaces, tabs, and line breaks) or any character that is a parenthesis or punctuation mark (.?!-). The group can also include any whitespace characters that follow these characters. Breaking it down further:

  • ( and ) indicate a capturing group

  • (?: ) is a non-capturing group, meaning it matches the pattern but doesn’t capture the matched text

  • [^()s]+ matches one or more characters that are not parentheses or whitespace characters

  • | indicates an alternative pattern

  • ().?!-

    matches any character that is a parenthesis or punctuation mark (.?!-)

  • s* matches zero or more whitespace characters (spaces, tabs, or line breaks) that follow the previous pattern.

/((?:[^()\s]+|[().?!-])\s*)/
PARAGRAPH_PATTERN =

This pattern matches one or more newline characters ‘n`, and any spaces between them. It is used to split the text into paragraphs.

  • (?:n *) is a non-capturing group that must start with a n and be followed by zero or more spaces.

  • ((?:n *)+) is the previous non-capturing group repeated one or more times.

/((?:\n *)+)/
SPACE_PATTERN =
/(\s+)/

Instance Method Summary collapse

Instance Method Details

#concatenate_paragraphs_and_add_chr182(text) ⇒ String

Split paragraphs and concatenate them. Then add a character ‘¶’ between paragraphs. For example, if the text is “HellonWorldnThis is a test”, the result will be: “Hello¶World¶This is a test”

Parameters:

  • text (String)

    the text to split

Returns:

  • (String)

    a string with paragraphs separated by ‘¶’



53
54
55
# File 'lib/rblines/redlines.rb', line 53

def concatenate_paragraphs_and_add_chr182(text)
  split_paragraphs(text).join("")
end

#split_paragraphs(text) ⇒ Array<String>

Splits a string into a list of paragraphs. One or more ‘n` splits the paragraphs. For example, if the text is “HellonWorldnThis is a test”, the result will be:

‘Hello’, ‘World’, ‘This is a test’

Parameters:

  • text (String)

    the text to split

Returns:

  • (Array<String>)

    a list of paragraphs



41
42
43
44
45
# File 'lib/rblines/redlines.rb', line 41

def split_paragraphs(text)
  text.split(PARAGRAPH_PATTERN)
    .map(&:strip)
    .reject(&:empty?)
end

#tokenize_text(text) ⇒ Array<String>

Tokenizes the text based on the TOKENIZER pattern.

Parameters:

  • text (String)

    the text to be tokenized

Returns:

  • (Array<String>)

    an array of tokenized words



31
32
33
# File 'lib/rblines/redlines.rb', line 31

def tokenize_text(text)
  text.scan(TOKENIZER).flatten
end