Module: PROIEL::Tokenization

Defined in:
lib/proiel/tokenization.rb

Constant Summary collapse

WORD_PATTERN =
/([^[\u{E000}-\u{F8FF}][[:word:]]]+)/.freeze

Class Method Summary collapse

Class Method Details

.is_splitable?(form) ⇒ true, false

Tests if a token form is splitable. Any form with more than one character is splitable.

Parameters:

  • form (String, nil)

    token form to Tests

Returns:

  • (true, false)

Raises:

  • (ArgumentError)


56
57
58
59
60
# File 'lib/proiel/tokenization.rb', line 56

def self.is_splitable?(form)
  raise ArgumentError, 'invalid form' unless form.is_a?(String) or form.nil?

  form and form.length > 1
end

.load_patterns(filename) ⇒ Hash

Loads tokenization patterns from a configuration file.

The configuration file should be a JSON file. The keys should be language tags and the values tokenization patterns.

The method can be called multiple times. On the first invocation patterns will be loaded, on subsequent invocations patterns will be updated. Only patterns for languages that are defined in the configuration file will be updated, other patterns will remain unchanged.

Parameters:

  • filename (String)

    name of tokenization pattern file

Returns:

  • (Hash)

    loaded patterns

Raises:

  • (ArgumentError)


22
23
24
25
26
27
28
29
30
31
# File 'lib/proiel/tokenization.rb', line 22

def self.load_patterns(filename)
  raise ArgumentError, 'invalid filename' unless filename.is_a?(String)

  patterns = JSON.parse(File.read(filename))

  regexes = patterns.map { |l, p| [l, make_regex(p)] }.to_h

  @@regexes ||= {}
  @@regexes.merge!(regexes)
end

.make_regex(pattern) ⇒ Regexp

Makes a regular expression from a pattern given in the configuration file.

The regular expression is to avoid partial matches. Multi-line matches are allowed in case characters that are interpreted as line separators occur in the data.

Parameters:

  • pattern (String)

    tokenization pattern

Returns:

  • (Regexp)

Raises:

  • (ArgumentError)


43
44
45
46
47
# File 'lib/proiel/tokenization.rb', line 43

def self.make_regex(pattern)
  raise ArgumentError, 'invalid pattern' unless pattern.is_a?(String)

  Regexp.new("^#{pattern}$", Regexp::MULTILINE)
end

.split_form(language_tag, form) ⇒ Array<String>

Splits a token form using the tokenization patterns that apply for a the specified language. Tokenization patterns must already have been loaded.

Parameters:

  • language_tag (String)

    ISO 639-3 tag for the language whose patterns should be used to split the token form

  • form (String)

    token form to split

Returns:

  • (Array<String>)

Raises:

  • (ArgumentError)


74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/proiel/tokenization.rb', line 74

def self.split_form(language_tag, form)
  raise ArgumentError, 'invalid language tag' unless language_tag.is_a?(String)
  raise ArgumentError, 'invalid form' unless form.is_a?(String)

  if form[WORD_PATTERN]
    # Split on any non-word character like a space or punctuation
    form.split(WORD_PATTERN)
  elsif @@regexes.key?(language_tag) and form[@@regexes[language_tag]]
    # Apply language-specific pattern
    form.match(@@regexes[language_tag]).captures
  elsif form == ''
    ['']
  else
    # Give up and split by character
    form.split(/()/)
  end
end