Module: TextStat::BasicStats

Included in:
Main
Defined in:
lib/textstat/basic_stats.rb

Overview

Basic text statistics calculations

This module provides fundamental text analysis methods such as counting characters, words, syllables, and sentences. These statistics form the foundation for more advanced readability calculations.

Examples:

Basic usage

text = "Hello world! This is a test."
TextStat.char_count(text)          # => 23
TextStat.lexicon_count(text)       # => 6
TextStat.syllable_count(text)      # => 6
TextStat.sentence_count(text)      # => 2

Since:

  • 1.0.0

Constant Summary collapse

NON_ALPHA_REGEX =

Frozen regex constants to avoid recompilation overhead

Since:

  • 1.0.0

/[^a-zA-Z\s]/.freeze
SENTENCE_BOUNDARY_REGEX =

Since:

  • 1.0.0

/[.?!]['\\)\]]*[ |\n][A-Z]/.freeze

Class Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Class Attribute Details

.hyphenator_cacheObject

Since:

  • 1.0.0



25
26
27
# File 'lib/textstat/basic_stats.rb', line 25

def hyphenator_cache
  @hyphenator_cache
end

Class Method Details

.clear_hyphenator_cacheHash

Clear all cached hyphenators

Returns:

  • (Hash)

    empty cache

Since:

  • 1.0.0



40
41
42
# File 'lib/textstat/basic_stats.rb', line 40

def clear_hyphenator_cache
  @hyphenator_cache.clear
end

.get_hyphenator(language) ⇒ Text::Hyphen

Get or create a cached Text::Hyphen instance for the specified language

Parameters:

  • language (String)

    language code

Returns:

  • (Text::Hyphen)

    cached hyphenator instance

Since:

  • 1.0.0



32
33
34
# File 'lib/textstat/basic_stats.rb', line 32

def get_hyphenator(language)
  @hyphenator_cache[language] ||= Text::Hyphen.new(language: language, left: 0, right: 0)
end

Instance Method Details

#avg_letter_per_word(text) ⇒ Float

Calculate average letters per word

Examples:

TextStat.avg_letter_per_word("hello world")  # => 5.0

Parameters:

  • text (String)

    the text to analyze

Returns:

  • (Float)

    average number of letters per word

Since:

  • 1.0.0



147
148
149
150
151
152
# File 'lib/textstat/basic_stats.rb', line 147

def avg_letter_per_word(text)
  letters_per_word = char_count(text).to_f / lexicon_count(text)
  letters_per_word.round(2)
rescue ZeroDivisionError
  0.0
end

#avg_sentence_length(text) ⇒ Float

Calculate average sentence length

Examples:

TextStat.avg_sentence_length("Hello world! How are you?")  # => 3.0

Parameters:

  • text (String)

    the text to analyze

Returns:

  • (Float)

    average number of words per sentence

Since:

  • 1.0.0



118
119
120
121
122
123
# File 'lib/textstat/basic_stats.rb', line 118

def avg_sentence_length(text)
  asl = lexicon_count(text).to_f / sentence_count(text)
  asl.round(1)
rescue ZeroDivisionError
  0.0
end

#avg_sentence_per_word(text) ⇒ Float

Calculate average sentences per word

Examples:

TextStat.avg_sentence_per_word("Hello world! How are you?")  # => 0.4

Parameters:

  • text (String)

    the text to analyze

Returns:

  • (Float)

    average number of sentences per word

Since:

  • 1.0.0



160
161
162
163
164
165
# File 'lib/textstat/basic_stats.rb', line 160

def avg_sentence_per_word(text)
  sentence_per_word = sentence_count(text).to_f / lexicon_count(text)
  sentence_per_word.round(2)
rescue ZeroDivisionError
  0.0
end

#avg_syllables_per_word(text, language = 'en_us') ⇒ Float

Calculate average syllables per word

Examples:

TextStat.avg_syllables_per_word("beautiful morning")  # => 2.5

Parameters:

  • text (String)

    the text to analyze

  • language (String) (defaults to: 'en_us')

    language code for hyphenation dictionary

Returns:

  • (Float)

    average number of syllables per word

Since:

  • 1.0.0



132
133
134
135
136
137
138
139
# File 'lib/textstat/basic_stats.rb', line 132

def avg_syllables_per_word(text, language = 'en_us')
  syllable = syllable_count(text, language)
  words = lexicon_count(text)
  syllables_per_word = syllable.to_f / words
  syllables_per_word.round(1)
rescue ZeroDivisionError
  0.0
end

#char_count(text, ignore_spaces = true) ⇒ Integer

Count characters in text

Examples:

TextStat.char_count("Hello world!")        # => 11
TextStat.char_count("Hello world!", false) # => 12

Parameters:

  • text (String)

    the text to analyze

  • ignore_spaces (Boolean) (defaults to: true)

    whether to ignore spaces in counting

Returns:

  • (Integer)

    number of characters

Since:

  • 1.0.0



52
53
54
55
# File 'lib/textstat/basic_stats.rb', line 52

def char_count(text, ignore_spaces = true)
  text = text.delete(' ') if ignore_spaces
  text.length
end

#lexicon_count(text, remove_punctuation = true) ⇒ Integer

Count words (lexicons) in text

Examples:

TextStat.lexicon_count("Hello, world!")       # => 2
TextStat.lexicon_count("Hello, world!", false) # => 2

Parameters:

  • text (String)

    the text to analyze

  • remove_punctuation (Boolean) (defaults to: true)

    whether to remove punctuation before counting

Returns:

  • (Integer)

    number of words

Since:

  • 1.0.0



65
66
67
68
# File 'lib/textstat/basic_stats.rb', line 65

def lexicon_count(text, remove_punctuation = true)
  text = text.gsub(NON_ALPHA_REGEX, '').squeeze(' ') if remove_punctuation
  text.split.count
end

#polysyllab_count(text, language = 'en_us') ⇒ Integer

Count polysyllabic words (3+ syllables)

Optimized to count syllables for all words in one pass using a cached hyphenator.

Examples:

TextStat.polysyllab_count("beautiful complicated")  # => 2

Parameters:

  • text (String)

    the text to analyze

  • language (String) (defaults to: 'en_us')

    language code for hyphenation dictionary

Returns:

  • (Integer)

    number of polysyllabic words

Since:

  • 1.0.0



176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# File 'lib/textstat/basic_stats.rb', line 176

def polysyllab_count(text, language = 'en_us')
  return 0 if text.empty?

  # Clean and split text once
  cleaned_text = text.downcase.gsub(NON_ALPHA_REGEX, '').squeeze(' ')
  words = cleaned_text.split
  return 0 if words.empty?

  # Use cached hyphenator for better performance
  hyphenator = BasicStats.get_hyphenator(language)
  count = 0
  words.each do |word|
    next if word.empty?

    word_hyphenated = hyphenator.visualise(word)
    syllables = word_hyphenated.count('-') + 1
    count += 1 if syllables >= 3
  end
  count
end

#sentence_count(text) ⇒ Integer

Count sentences in text

Identifies sentence boundaries using punctuation marks (.!?) followed by whitespace and capital letters.

Examples:

TextStat.sentence_count("Hello world! How are you?")  # => 2
TextStat.sentence_count("Dr. Smith went to the U.S.A.") # => 1

Parameters:

  • text (String)

    the text to analyze

Returns:

  • (Integer)

    number of sentences

Since:

  • 1.0.0



108
109
110
# File 'lib/textstat/basic_stats.rb', line 108

def sentence_count(text)
  text.scan(SENTENCE_BOUNDARY_REGEX).map(&:strip).count + 1
end

#syllable_count(text, language = 'en_us') ⇒ Integer

Count syllables in text using hyphenation

Uses the text-hyphen library for accurate syllable counting across different languages. Supports 22 languages including English, Spanish, French, German, and more. Hyphenator instances are cached for performance.

Examples:

TextStat.syllable_count("beautiful")          # => 3
TextStat.syllable_count("hello", "en_us")      # => 2
TextStat.syllable_count("bonjour", "fr")       # => 2

Parameters:

  • text (String)

    the text to analyze

  • language (String) (defaults to: 'en_us')

    language code for hyphenation dictionary

Returns:

  • (Integer)

    number of syllables

See Also:

  • DictionaryManager.supported_languages

Since:

  • 1.0.0



84
85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/textstat/basic_stats.rb', line 84

def syllable_count(text, language = 'en_us')
  return 0 if text.empty?

  text = text.downcase
  text.gsub(NON_ALPHA_REGEX, '').squeeze(' ') # NOTE: not assigned back (matches original behavior)
  hyphenator = BasicStats.get_hyphenator(language)
  count = 0
  text.split.each do |word|
    word_hyphenated = hyphenator.visualise(word)
    count += word_hyphenated.count('-') + 1
  end
  count
end