Module: TextStat::DictionaryManager

Included in:
Main
Defined in:
lib/textstat/dictionary_manager.rb

Overview

Dictionary management with high-performance caching

This module handles loading and caching of language-specific dictionaries used for identifying difficult words. The caching system provides a 36x performance improvement over reading dictionaries from disk on every call.

Examples:

Performance optimization

# First call loads dictionary from disk
TextStat.difficult_words(text, 'en_us')  # ~0.047s

# Subsequent calls use cached dictionary
TextStat.difficult_words(text, 'en_us')  # ~0.0013s (36x faster!)

# Check cache status
TextStat::DictionaryManager.cache_size        # => 1
TextStat::DictionaryManager.cached_languages  # => ['en_us']

Multi-language support

TextStat.difficult_words(english_text, 'en_us')
TextStat.difficult_words(spanish_text, 'es')
TextStat.difficult_words(french_text, 'fr')
TextStat::DictionaryManager.cache_size  # => 3

Author:

  • Jakub Polak

Since:

  • 1.0.0

Class Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Class Attribute Details

.dictionary_cacheObject

Since:

  • 1.0.0



34
35
36
# File 'lib/textstat/dictionary_manager.rb', line 34

def dictionary_cache
  @dictionary_cache
end

Class Method Details

.cache_sizeInteger

Get number of cached dictionaries

Examples:

TextStat::DictionaryManager.cache_size  # => 3

Returns:

  • (Integer)

    number of dictionaries currently in cache

Since:

  • 1.0.0



107
108
109
# File 'lib/textstat/dictionary_manager.rb', line 107

def cache_size
  @dictionary_cache.size
end

.cached_languagesArray<String>

Get list of cached languages

Examples:

TextStat::DictionaryManager.cached_languages  # => ['en_us', 'es', 'fr']

Returns:

  • (Array<String>)

    array of language codes currently in cache

Since:

  • 1.0.0



98
99
100
# File 'lib/textstat/dictionary_manager.rb', line 98

def cached_languages
  @dictionary_cache.keys
end

.clear_cacheHash

Clear all cached dictionaries

Removes all dictionaries from memory cache. Useful for memory management in long-running applications or when switching between different sets of languages.

Returns:

  • (Hash)

    empty cache hash

Since:

  • 1.0.0



89
90
91
# File 'lib/textstat/dictionary_manager.rb', line 89

def clear_cache
  @dictionary_cache.clear
end

.dictionary_pathString

Get path to dictionary files

Examples:

TextStat::DictionaryManager.dictionary_path
# => \"/path/to/gem/lib/dictionaries\"

Returns:

  • (String)

    absolute path to dictionary directory

Since:

  • 1.0.0



117
118
119
# File 'lib/textstat/dictionary_manager.rb', line 117

def dictionary_path
  @dictionary_path ||= File.join(TextStat::GEM_PATH, 'lib', 'dictionaries')
end

.dictionary_path=(path) ⇒ String

Set dictionary path

Parameters:

  • path (String)

    path to dictionary directory

Returns:

  • (String)

    the set path

Since:

  • 1.0.0



40
41
42
# File 'lib/textstat/dictionary_manager.rb', line 40

def dictionary_path=(path)
  @dictionary_path = path
end

.load_dictionary(language) ⇒ Set

Load dictionary with automatic caching

Loads a language-specific dictionary from disk and caches it in memory for subsequent calls. This provides significant performance improvements for repeated operations. Uses optimized file reading with streaming for better performance and memory efficiency.

Examples:

dict = TextStat::DictionaryManager.load_dictionary('en_us')
dict.include?('hello')  # => true
dict.include?('comprehensive')  # => false

Parameters:

  • language (String)

    language code (e.g., ‘en_us’, ‘es’, ‘fr’)

Returns:

  • (Set)

    set of easy words for the specified language

See Also:

  • #supported_languages

Since:

  • 1.0.0



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/textstat/dictionary_manager.rb', line 58

def load_dictionary(language)
  # Return cached dictionary if available
  return @dictionary_cache[language] if @dictionary_cache[language]

  # Load dictionary from file
  dictionary_file = File.join(dictionary_path, "#{language}.txt")
  easy_words = Set.new

  if File.exist?(dictionary_file)
    # Use foreach for streaming - efficient and memory-friendly for large files
    File.foreach(dictionary_file, chomp: true) do |line|
      easy_words << line
    end
  end

  # Cache the loaded dictionary
  @dictionary_cache[language] = easy_words
  easy_words
end

Instance Method Details

#difficult_words(text, language = 'en_us', return_words = false) ⇒ Integer, Set

Count difficult words in text

Identifies words that are considered difficult based on:

  1. Not being in the language’s easy words dictionary

  2. Having more than one syllable

This method uses the cached dictionary and hyphenator systems for optimal performance.

Examples:

Count difficult words

TextStat.difficult_words(\"This is a comprehensive analysis\")  # => 2

Get list of difficult words

words = TextStat.difficult_words(\"comprehensive analysis\", 'en_us', true)
words.to_a  # => [\"comprehensive\", \"analysis\"]

Multi-language support

TextStat.difficult_words(spanish_text, 'es')  # Spanish dictionary
TextStat.difficult_words(french_text, 'fr')   # French dictionary

Parameters:

  • text (String)

    the text to analyze

  • language (String) (defaults to: 'en_us')

    language code for dictionary selection

  • return_words (Boolean) (defaults to: false)

    whether to return words array or count

Returns:

  • (Integer, Set)

    number of difficult words or set of difficult words

Since:

  • 1.0.0



144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/textstat/dictionary_manager.rb', line 144

def difficult_words(text, language = 'en_us', return_words = false)
  easy_words = DictionaryManager.load_dictionary(language)

  # Clean and split text once
  text_list = text.downcase.gsub(/[^0-9a-z ]/i, '').split
  return return_words ? Set.new : 0 if text_list.empty?

  # Get cached hyphenator for syllable counting
  hyphenator = BasicStats.get_hyphenator(language)
  diff_words_set = Set.new

  # Process each word once
  text_list.each do |word|
    next if easy_words.include?(word)

    # Count syllables inline using cached hyphenator
    word_hyphenated = hyphenator.visualise(word)
    syllables = word_hyphenated.count('-') + 1
    diff_words_set.add(word) if syllables > 1
  end

  return_words ? diff_words_set : diff_words_set.length
end