Class: Analyzers::Utils::SpellChecker
- Inherits:
-
Object
- Object
- Analyzers::Utils::SpellChecker
- Defined in:
- lib/crypto-toolbox/analyzers/utils/spell_checker.rb
Instance Method Summary collapse
-
#human_language?(str) ⇒ Boolean
Check whether a given string seems to be part of a human language using the given dictionary.
- #human_phrase?(string) ⇒ Boolean
- #human_word?(str) ⇒ Boolean
-
#initialize(dict_lang = "en_US") ⇒ SpellChecker
constructor
A new instance of SpellChecker.
-
#known_words(str) ⇒ Object
NOTE: About spelling error rates and language detection:.
- #suggest(str) ⇒ Object
Constructor Details
#initialize(dict_lang = "en_US") ⇒ SpellChecker
Returns a new instance of SpellChecker.
8 9 10 11 |
# File 'lib/crypto-toolbox/analyzers/utils/spell_checker.rb', line 8 def initialize(dict_lang="en_US") @dict = FFI::Hunspell.dict(dict_lang) # @dict2 = FFI::Aspell::Speller.new(dict_lang) end |
Instance Method Details
#human_language?(str) ⇒ Boolean
Check whether a given string seems to be part of a human language using the given dictionary
NOTE: Using shell instead of hunspell ffi causes lots of escaping errors, even with shellwords.escape errors = Float(‘echo ’#Shellwords.escape(str)‘ |hunspell -l |wc -l `.split.first)
51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/crypto-toolbox/analyzers/utils/spell_checker.rb', line 51 def human_language?(str) #NOTE should be reject 1char numbers or all 1 char symbols words = str.split(" ").reject{|w| (w.length < 2 || w =~ /^[0-9]+$/) } word_amount = words.length errors = words.map{|e| check?(e) }.count{|e| e == false} error_rate = errors.to_f/word_amount report_error_rate(str,error_rate) if ENV["DEBUG_ANALYSIS"] error_rate_sufficient?(error_rate) end |
#human_phrase?(string) ⇒ Boolean
38 39 40 |
# File 'lib/crypto-toolbox/analyzers/utils/spell_checker.rb', line 38 def human_phrase?(string) string.split(" ").all?{|part| human_word?(part)} end |
#human_word?(str) ⇒ Boolean
34 35 36 |
# File 'lib/crypto-toolbox/analyzers/utils/spell_checker.rb', line 34 def human_word?(str) check?(str) end |
#known_words(str) ⇒ Object
NOTE: About spelling error rates and language detection:
missing punctuation support may lead to > 2% errors on valid texts, thus we use a high value . invalid decryptions tend to have spell error rates > 70 Some statistics about it: > summary(invalids)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.6000 1.0000 1.0000 0.9878 1.0000 1.0000
> summary(cut(invalids,10))
(0.6,0.64] (0.64,0.68] (0.68,0.72] (0.72,0.76] (0.76,0.8] (0.8,0.84]
8 13 9 534 1319 2809
(0.84,0.88] (0.88,0.92] (0.92,0.96] (0.96,1]
10581 46598 198477 1440651
NOTE: There is ony caveat: Short messages with < 5 words may have 33 or 50% error rates if numbers or single char words are taken into account
30 31 32 |
# File 'lib/crypto-toolbox/analyzers/utils/spell_checker.rb', line 30 def known_words(str) words = str.split(" ").select{|w| check?(w) } end |
#suggest(str) ⇒ Object
42 43 44 |
# File 'lib/crypto-toolbox/analyzers/utils/spell_checker.rb', line 42 def suggest(str) @dict.suggest(str) end |