Class: UTF8Utils::Char

Inherits:
Array
  • Object
show all
Defined in:
lib/utf8_utils/char.rb

Instance Method Summary collapse

Instance Method Details

#expected_lengthObject

Given the first byte, how many bytes long should this character be?



6
7
8
# File 'lib/utf8_utils/char.rb', line 6

def expected_length
  (first.continuations rescue 0) + 1
end

#invalid?Boolean

Is the character invalid?

Returns:

  • (Boolean)


11
12
13
# File 'lib/utf8_utils/char.rb', line 11

def invalid?
  !valid?
end

#tidyObject

Attempt to rescue a valid UTF-8 character from a malformed character. It will first attempt to convert from CP1251, and if this isn’t possible, it prepends a valid leading byte, treating the character as the last byte in a two-byte character. Note that much of the logic here is taken from ActiveSupport; the difference is that this works for Ruby 1.8.6 - 1.9.1.



20
21
22
23
24
25
26
27
28
29
30
# File 'lib/utf8_utils/char.rb', line 20

def tidy
  return self if valid?
  byte = first.to_i
  if UTF8Utils::CP1251.key? byte
    self.class.new [UTF8Utils::CP1251[byte]]
  elsif byte < 192
    self.class.new [194, byte]
  else
    self.class.new [195, byte - 64]
  end
end

#to_codepointObject



37
38
39
# File 'lib/utf8_utils/char.rb', line 37

def to_codepoint
  flatten.map {|b| b.to_i }.pack("C*").unpack("U*")[0]
end

#to_sObject

Get a multibyte character from the bytes.



33
34
35
# File 'lib/utf8_utils/char.rb', line 33

def to_s
  flatten.map {|b| b.to_i }.pack("C*").unpack("U*").pack("U*")
end

#valid?Boolean

Returns:

  • (Boolean)


41
42
43
44
45
46
47
48
49
# File 'lib/utf8_utils/char.rb', line 41

def valid?
  return false if length != expected_length
  each_with_index do |byte, index|
    return false if byte.invalid?
    return false if index == 0 and byte.continuation?
    return false if index > 0 and !byte.continuation?
  end
  true
end