Class: Analyzers::Utils::AsciiLanguageDetector
- Inherits:
-
Object
- Object
- Analyzers::Utils::AsciiLanguageDetector
- Defined in:
- lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb
Constant Summary collapse
- ASCII_BASE_RANGE =
(32..127).freeze
- ASCII_BLACKLIST =
[40,41,42,43,47,60,61,62,91,92,93,94,95,96,35,59].freeze
- ASCII_WHITELIST =
[10]
- ASCII_CHARACTERS =
10 == n is now allowed!
( ASCII_BASE_RANGE.to_a + ASCII_WHITELIST - ASCII_BLACKLIST ).to_ary.freeze
Instance Method Summary collapse
- #ascii_lingual?(buf) ⇒ Boolean
-
#ascii_lingual_byte?(byte) ⇒ Boolean
NOTE: This is the output of the benchmark script contained in this gem see: benchmarks/language_detection.rb It compares many ways of filtering bytes to check if only “plain” language characters are contained.
- #ascii_lingual_bytes ⇒ Object
- #ascii_lingual_bytes?(bytes) ⇒ Boolean
- #ascii_lingual_chars ⇒ Object
Instance Method Details
#ascii_lingual?(buf) ⇒ Boolean
46 47 48 |
# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 46 def ascii_lingual?(buf) ascii_lingual_bytes?(buf.bytes) end |
#ascii_lingual_byte?(byte) ⇒ Boolean
NOTE: This is the output of the benchmark script contained in this gem
see: benchmarks/language_detection.rb
It compares many ways of filtering bytes to check if only “plain” language characters are contained. Result:
Comparison:
ascii_range_check: 1773.5 i/s <- use range.cover? and then blacklist.include
ascii_lingual_byte?: 1494.8 i/s - 1.19x slower <- now uses range.cover? internally
ascii_lingual_bytes?: 1459.2 i/s - 1.22x slower <- see prev. but get the entire byte array
ascii_lingual?: 1420.1 i/s - 1.25x slower <- see prev. but works on crypt buffers
ascii_lingual_and_human_language: 1413.6 i/s - 1.25x slower <- use human_languge?, but apply 0 < byte < 127 first
ascii_shift_check: 634.4 i/s - 2.80x slower <- uses & (1 << 5).zero? but has to do slow additional checks
ascii_whitelist.bsearch?: 483.8 i/s - 3.50x slower <- whitelist lookup using bsearch hunspell.human_language?: 212.3 i/s - 8.35x slower <- use human_languge? ascii_whitelist.include?: 90.2 i/s - 19.67x slower <- use (whitelist - blacklist).include? hunspell_human_language_without_dict: 0.2 i/s - 10013.62x slower <- instanciating the dict seems to be very very slow…
NOTE:
Normally the shift solution would be the fastes, but we have to convert back and forth,
thus the range.cover? check still seems to be the best soution. It is also more readable
(We need the chr.downcase.ord conversion to support upper case letters)
byte < 127 && !(byte.chr.downcase.ord & (1 << 5)).zero?
33 34 35 36 |
# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 33 def ascii_lingual_byte?(byte) # check how fast bsearch is, if range.cover is no longer needed we can nicely add 10 to the array (ascii_base_range.cover?(byte) && !ascii_blacklist.include?(byte)) || ( ascii_whitelist.bsearch{|i| i == byte} ) end |
#ascii_lingual_bytes ⇒ Object
50 51 52 |
# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 50 def ascii_lingual_bytes ascii_whitelist.to_ary end |
#ascii_lingual_bytes?(bytes) ⇒ Boolean
38 39 40 |
# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 38 def ascii_lingual_bytes?(bytes) bytes.all?{|b| ascii_lingual_byte?(b) } end |
#ascii_lingual_chars ⇒ Object
42 43 44 |
# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 42 def ascii_lingual_chars ASCII_CHARACTERS end |