Class: Analyzers::Utils::AsciiLanguageDetector

Inherits:

Object

Object
Analyzers::Utils::AsciiLanguageDetector

Defined in:: lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb

Constant Summary collapse

ASCII_BASE_RANGE =

(32..127).freeze

ASCII_BLACKLIST =

[40,41,42,43,47,60,61,62,91,92,93,94,95,96,35,59].freeze

ASCII_WHITELIST =

[10]

ASCII_CHARACTERS = 10 == n is now allowed!

( ASCII_BASE_RANGE.to_a + ASCII_WHITELIST - ASCII_BLACKLIST ).to_ary.freeze

Instance Method Summary collapse

#ascii_lingual?(buf) ⇒ Boolean
#ascii_lingual_byte?(byte) ⇒ Boolean

NOTE: This is the output of the benchmark script contained in this gem see: benchmarks/language_detection.rb It compares many ways of filtering bytes to check if only “plain” language characters are contained.
#ascii_lingual_bytes ⇒ Object
#ascii_lingual_bytes?(bytes) ⇒ Boolean
#ascii_lingual_chars ⇒ Object

Instance Method Details

#ascii_lingual?(buf) ⇒ `Boolean`

Returns:

(Boolean)



46
47
48

# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 46

def ascii_lingual?(buf)
  ascii_lingual_bytes?(buf.bytes)
end

#ascii_lingual_byte?(byte) ⇒ `Boolean`

NOTE: This is the output of the benchmark script contained in this gem

see: benchmarks/language_detection.rb

It compares many ways of filtering bytes to check if only “plain” language characters are contained. Result:

Comparison:

  ascii_range_check:                 1773.5 i/s                <- use range.cover? and then blacklist.include
ascii_lingual_byte?:                 1494.8 i/s - 1.19x slower <- now uses range.cover? internally

ascii_lingual_bytes?: 1459.2 i/s - 1.22x slower <- see prev. but get the entire byte array

ascii_lingual?:                 1420.1 i/s - 1.25x slower <- see prev. but works on crypt buffers

ascii_lingual_and_human_language: 1413.6 i/s - 1.25x slower <- use human_languge?, but apply 0 < byte < 127 first

ascii_shift_check:                  634.4 i/s - 2.80x slower  <- uses & (1 << 5).zero? but has to do slow additional checks

ascii_whitelist.bsearch?: 483.8 i/s - 3.50x slower <- whitelist lookup using bsearch hunspell.human_language?: 212.3 i/s - 8.35x slower <- use human_languge? ascii_whitelist.include?: 90.2 i/s - 19.67x slower <- use (whitelist - blacklist).include? hunspell_human_language_without_dict: 0.2 i/s - 10013.62x slower <- instanciating the dict seems to be very very slow…

NOTE:

Normally the shift solution would be the fastes, but we have to convert back and forth,
thus the range.cover? check still seems to be the best soution. It is also more readable

(We need the chr.downcase.ord conversion to support upper case letters)
byte < 127 && !(byte.chr.downcase.ord & (1 << 5)).zero?

Returns:

(Boolean)

# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 33

def ascii_lingual_byte?(byte)
  # check how fast bsearch is, if range.cover is no longer needed we can nicely add 10 to the array
  (ascii_base_range.cover?(byte) && !ascii_blacklist.include?(byte)) || ( ascii_whitelist.bsearch{|i| i == byte} )
end

#ascii_lingual_bytes ⇒ `Object`



50
51
52

# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 50

def ascii_lingual_bytes
  ascii_whitelist.to_ary
end

#ascii_lingual_bytes?(bytes) ⇒ `Boolean`

Returns:

(Boolean)



38
39
40

# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 38

def ascii_lingual_bytes?(bytes)
  bytes.all?{|b| ascii_lingual_byte?(b) }
end

#ascii_lingual_chars ⇒ `Object`



42
43
44

# File 'lib/crypto-toolbox/analyzers/utils/ascii_language_detector.rb', line 42

def ascii_lingual_chars
  ASCII_CHARACTERS
end

Class: Analyzers::Utils::AsciiLanguageDetector

Constant Summary collapse

Instance Method Summary collapse

Instance Method Details

#ascii_lingual?(buf) ⇒ Boolean

#ascii_lingual_byte?(byte) ⇒ Boolean

#ascii_lingual_bytes ⇒ Object

#ascii_lingual_bytes?(bytes) ⇒ Boolean

#ascii_lingual_chars ⇒ Object

#ascii_lingual?(buf) ⇒ `Boolean`

#ascii_lingual_byte?(byte) ⇒ `Boolean`

#ascii_lingual_bytes ⇒ `Object`

#ascii_lingual_bytes?(bytes) ⇒ `Boolean`

#ascii_lingual_chars ⇒ `Object`