Class: WordsCounted::Tokeniser

Inherits:
Object
  • Object
show all
Defined in:
lib/words_counted/tokeniser.rb

Constant Summary collapse

TOKEN_REGEXP =

Default tokenisation strategy

/[\p{Alpha}\-']+/

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Tokeniser

Initialises state with the string to be tokenised.

Parameters:

  • input (String)

    The string to tokenise



21
22
23
# File 'lib/words_counted/tokeniser.rb', line 21

def initialize(input)
  @input = input
end

Instance Method Details

#tokenise(pattern: TOKEN_REGEXP, exclude: nil) ⇒ Array

Converts a string into an array of tokens using a regular expression. If a regexp is not provided a default one is used. See Tokenizer.TOKEN_REGEXP.

Use exclude to remove tokens from the final list. exclude can be a string, a regular expression, a lambda, a symbol, or an array of one or more of those types. This allows for powerful and flexible tokenisation strategies.

If a symbol is passed, it must name a predicate method.

Examples:

WordsCounted::Tokeniser.new("Hello World").tokenise
# => ['hello', 'world']

With pattern

WordsCounted::Tokeniser.new("Hello-Mohamad").tokenise(pattern: /[^-]+/)
# => ['hello', 'mohamad']

With exclude as a string

WordsCounted::Tokeniser.new("Hello Sami").tokenise(exclude: "hello")
# => ['sami']

With exclude as a regexp

WordsCounted::Tokeniser.new("Hello Dani").tokenise(exclude: /hello/i)
# => ['dani']

With exclude as a lambda

WordsCounted::Tokeniser.new("Goodbye Sami").tokenise(
  exclude: ->(token) { token.length > 6 }
)
# => ['sami']

With exclude as a symbol

WordsCounted::Tokeniser.new("Hello محمد").tokenise(exclude: :ascii_only?)
# => ['محمد']

With exclude as an array of strings

WordsCounted::Tokeniser.new("Goodbye Sami and hello Dani").tokenise(
  exclude: ["goodbye hello"]
)
# => ['sami', 'and', dani']

With exclude as an array of regular expressions

WordsCounted::Tokeniser.new("Goodbye and hello Dani").tokenise(
  exclude: [/goodbye/i, /and/i]
)
# => ['hello', 'dani']

With exclude as an array of lambdas

t = WordsCounted::Tokeniser.new("Special Agent 007")
t.tokenise(
  exclude: [
    ->(t) { t.to_i.odd? },
    ->(t) { t.length > 5}
  ]
)
# => ['agent']

With exclude as a mixed array

t = WordsCounted::Tokeniser.new("Hello! اسماءنا هي محمد، كارولينا، سامي، وداني")
t.tokenise(
  exclude: [
    :ascii_only?,
    /محمد/,
    ->(t) { t.length > 6},
    "و"
  ]
)
# => ["هي", "سامي", "وداني"]

Parameters:

  • pattern (Regexp) (defaults to: TOKEN_REGEXP)

    The string to tokenise

  • exclude (Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol, nil) (defaults to: nil)

    The filter to apply

Returns:

  • (Array)

    The array of filtered tokens



97
98
99
100
# File 'lib/words_counted/tokeniser.rb', line 97

def tokenise(pattern: TOKEN_REGEXP, exclude: nil)
  filter_proc = filter_to_proc(exclude)
  @input.scan(pattern).map(&:downcase).reject { |token| filter_proc.call(token) }
end