Module: TextStat::BasicStats
- Included in:
- Main
- Defined in:
- lib/textstat/basic_stats.rb
Overview
Basic text statistics calculations
This module provides fundamental text analysis methods such as counting characters, words, syllables, and sentences. These statistics form the foundation for more advanced readability calculations.
Constant Summary collapse
- NON_ALPHA_REGEX =
Frozen regex constants to avoid recompilation overhead
/[^a-zA-Z\s]/.freeze
- SENTENCE_BOUNDARY_REGEX =
/[.?!]['\\)\]]*[ |\n][A-Z]/.freeze
Class Attribute Summary collapse
Class Method Summary collapse
-
.clear_hyphenator_cache ⇒ Hash
Clear all cached hyphenators.
-
.get_hyphenator(language) ⇒ Text::Hyphen
Get or create a cached Text::Hyphen instance for the specified language.
Instance Method Summary collapse
-
#avg_letter_per_word(text) ⇒ Float
Calculate average letters per word.
-
#avg_sentence_length(text) ⇒ Float
Calculate average sentence length.
-
#avg_sentence_per_word(text) ⇒ Float
Calculate average sentences per word.
-
#avg_syllables_per_word(text, language = 'en_us') ⇒ Float
Calculate average syllables per word.
-
#char_count(text, ignore_spaces = true) ⇒ Integer
Count characters in text.
-
#lexicon_count(text, remove_punctuation = true) ⇒ Integer
Count words (lexicons) in text.
-
#polysyllab_count(text, language = 'en_us') ⇒ Integer
Count polysyllabic words (3+ syllables).
-
#sentence_count(text) ⇒ Integer
Count sentences in text.
-
#syllable_count(text, language = 'en_us') ⇒ Integer
Count syllables in text using hyphenation.
Class Attribute Details
.hyphenator_cache ⇒ Object
25 26 27 |
# File 'lib/textstat/basic_stats.rb', line 25 def hyphenator_cache @hyphenator_cache end |
Class Method Details
.clear_hyphenator_cache ⇒ Hash
Clear all cached hyphenators
40 41 42 |
# File 'lib/textstat/basic_stats.rb', line 40 def clear_hyphenator_cache @hyphenator_cache.clear end |
.get_hyphenator(language) ⇒ Text::Hyphen
Get or create a cached Text::Hyphen instance for the specified language
32 33 34 |
# File 'lib/textstat/basic_stats.rb', line 32 def get_hyphenator(language) @hyphenator_cache[language] ||= Text::Hyphen.new(language: language, left: 0, right: 0) end |
Instance Method Details
#avg_letter_per_word(text) ⇒ Float
Calculate average letters per word
147 148 149 150 151 152 |
# File 'lib/textstat/basic_stats.rb', line 147 def avg_letter_per_word(text) letters_per_word = char_count(text).to_f / lexicon_count(text) letters_per_word.round(2) rescue ZeroDivisionError 0.0 end |
#avg_sentence_length(text) ⇒ Float
Calculate average sentence length
118 119 120 121 122 123 |
# File 'lib/textstat/basic_stats.rb', line 118 def avg_sentence_length(text) asl = lexicon_count(text).to_f / sentence_count(text) asl.round(1) rescue ZeroDivisionError 0.0 end |
#avg_sentence_per_word(text) ⇒ Float
Calculate average sentences per word
160 161 162 163 164 165 |
# File 'lib/textstat/basic_stats.rb', line 160 def avg_sentence_per_word(text) sentence_per_word = sentence_count(text).to_f / lexicon_count(text) sentence_per_word.round(2) rescue ZeroDivisionError 0.0 end |
#avg_syllables_per_word(text, language = 'en_us') ⇒ Float
Calculate average syllables per word
132 133 134 135 136 137 138 139 |
# File 'lib/textstat/basic_stats.rb', line 132 def avg_syllables_per_word(text, language = 'en_us') syllable = syllable_count(text, language) words = lexicon_count(text) syllables_per_word = syllable.to_f / words syllables_per_word.round(1) rescue ZeroDivisionError 0.0 end |
#char_count(text, ignore_spaces = true) ⇒ Integer
Count characters in text
52 53 54 55 |
# File 'lib/textstat/basic_stats.rb', line 52 def char_count(text, ignore_spaces = true) text = text.delete(' ') if ignore_spaces text.length end |
#lexicon_count(text, remove_punctuation = true) ⇒ Integer
Count words (lexicons) in text
65 66 67 68 |
# File 'lib/textstat/basic_stats.rb', line 65 def lexicon_count(text, remove_punctuation = true) text = text.gsub(NON_ALPHA_REGEX, '').squeeze(' ') if remove_punctuation text.split.count end |
#polysyllab_count(text, language = 'en_us') ⇒ Integer
Count polysyllabic words (3+ syllables)
Optimized to count syllables for all words in one pass using a cached hyphenator.
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/textstat/basic_stats.rb', line 176 def polysyllab_count(text, language = 'en_us') return 0 if text.empty? # Clean and split text once cleaned_text = text.downcase.gsub(NON_ALPHA_REGEX, '').squeeze(' ') words = cleaned_text.split return 0 if words.empty? # Use cached hyphenator for better performance hyphenator = BasicStats.get_hyphenator(language) count = 0 words.each do |word| next if word.empty? word_hyphenated = hyphenator.visualise(word) syllables = word_hyphenated.count('-') + 1 count += 1 if syllables >= 3 end count end |
#sentence_count(text) ⇒ Integer
Count sentences in text
Identifies sentence boundaries using punctuation marks (.!?) followed by whitespace and capital letters.
108 109 110 |
# File 'lib/textstat/basic_stats.rb', line 108 def sentence_count(text) text.scan(SENTENCE_BOUNDARY_REGEX).map(&:strip).count + 1 end |
#syllable_count(text, language = 'en_us') ⇒ Integer
Count syllables in text using hyphenation
Uses the text-hyphen library for accurate syllable counting across different languages. Supports 22 languages including English, Spanish, French, German, and more. Hyphenator instances are cached for performance.
84 85 86 87 88 89 90 91 92 93 94 95 96 |
# File 'lib/textstat/basic_stats.rb', line 84 def syllable_count(text, language = 'en_us') return 0 if text.empty? text = text.downcase text.gsub(NON_ALPHA_REGEX, '').squeeze(' ') # NOTE: not assigned back (matches original behavior) hyphenator = BasicStats.get_hyphenator(language) count = 0 text.split.each do |word| word_hyphenated = hyphenator.visualise(word) count += word_hyphenated.count('-') + 1 end count end |