Top Level Namespace

Defined Under Namespace

Modules: Rblines

Constant Summary collapse

TOKENIZER = This regular expression matches a group of characters that can include any character except for parentheses and whitespace characters (which include spaces, tabs, and line breaks) or any character that is a parenthesis or punctuation mark (.?!-). The group can also include any whitespace characters that follow these characters. Breaking it down further: ( and ) indicate a capturing group (?: ) is a non-capturing group, meaning it matches the pattern but doesn’t capture the matched text [^()s]+ matches one or more characters that are not parentheses or whitespace characters | indicates an alternative pattern ().?!- matches any character that is a parenthesis or punctuation mark (.?!-) s* matches zero or more whitespace characters (spaces, tabs, or line breaks) that follow the previous pattern.

/((?:[^()\s]+|[().?!-])\s*)/

PARAGRAPH_PATTERN = This pattern matches one or more newline characters ‘n`, and any spaces between them. It is used to split the text into paragraphs. (?:n *) is a non-capturing group that must start with a n and be followed by zero or more spaces. ((?:n *)+) is the previous non-capturing group repeated one or more times.

/((?:\n *)+)/

SPACE_PATTERN =

/(\s+)/

Instance Method Summary collapse

#concatenate_paragraphs_and_add_chr182(text) ⇒ String

Split paragraphs and concatenate them.
#split_paragraphs(text) ⇒ Array<String>

Splits a string into a list of paragraphs.
#tokenize_text(text) ⇒ Array<String>

Tokenizes the text based on the TOKENIZER pattern.

Instance Method Details

#concatenate_paragraphs_and_add_chr182(text) ⇒ `String`

Split paragraphs and concatenate them. Then add a character ‘¶’ between paragraphs. For example, if the text is “HellonWorldnThis is a test”, the result will be: “Hello¶World¶This is a test”

Parameters:

text (String) —

the text to split

Returns:

(String) —

a string with paragraphs separated by ‘¶’



53
54
55

# File 'lib/rblines/redlines.rb', line 53

def concatenate_paragraphs_and_add_chr182(text)
  split_paragraphs(text).join(" ¶ ")
end

#split_paragraphs(text) ⇒ `Array<String>`

Splits a string into a list of paragraphs. One or more ‘n` splits the paragraphs. For example, if the text is “HellonWorldnThis is a test”, the result will be:

‘Hello’, ‘World’, ‘This is a test’

Parameters:

text (String) —

the text to split

Returns:

(Array<String>) —

a list of paragraphs

# File 'lib/rblines/redlines.rb', line 41

def split_paragraphs(text)
  text.split(PARAGRAPH_PATTERN)
    .map(&:strip)
    .reject(&:empty?)
end

#tokenize_text(text) ⇒ `Array<String>`

Tokenizes the text based on the TOKENIZER pattern.

Parameters:

text (String) —

the text to be tokenized

Returns:

(Array<String>) —

an array of tokenized words



31
32
33

# File 'lib/rblines/redlines.rb', line 31

def tokenize_text(text)
  text.scan(TOKENIZER).flatten
end

Top Level Namespace

Defined Under Namespace

Constant Summary collapse

Instance Method Summary collapse

Instance Method Details

#concatenate_paragraphs_and_add_chr182(text) ⇒ String

#split_paragraphs(text) ⇒ Array<String>

#tokenize_text(text) ⇒ Array<String>

#concatenate_paragraphs_and_add_chr182(text) ⇒ `String`

#split_paragraphs(text) ⇒ `Array<String>`

#tokenize_text(text) ⇒ `Array<String>`