Class: UTF8Utils::Char

Inherits:

Array

Object
Array
UTF8Utils::Char

show all

Defined in:: lib/utf8_utils/char.rb

Instance Method Summary collapse

#expected_length ⇒ Object

Given the first byte, how many bytes long should this character be?.
#invalid? ⇒ Boolean

Is the character invalid?.
#tidy ⇒ Object

Attempt to rescue a valid UTF-8 character from a malformed character.
#to_codepoint ⇒ Object
#to_s ⇒ Object

Get a multibyte character from the bytes.
#valid? ⇒ Boolean

Instance Method Details

#expected_length ⇒ `Object`

Given the first byte, how many bytes long should this character be?



6
7
8

# File 'lib/utf8_utils/char.rb', line 6

def expected_length
  (first.continuations rescue 0) + 1
end

#invalid? ⇒ `Boolean`

Is the character invalid?

Returns:

(Boolean)



11
12
13

# File 'lib/utf8_utils/char.rb', line 11

def invalid?
  !valid?
end

#tidy ⇒ `Object`

Attempt to rescue a valid UTF-8 character from a malformed character. It will first attempt to convert from CP1251, and if this isn’t possible, it prepends a valid leading byte, treating the character as the last byte in a two-byte character. Note that much of the logic here is taken from ActiveSupport; the difference is that this works for Ruby 1.8.6 - 1.9.1.

# File 'lib/utf8_utils/char.rb', line 20

def tidy
  return self if valid?
  byte = first.to_i
  if UTF8Utils::CP1251.key? byte
    self.class.new [UTF8Utils::CP1251[byte]]
  elsif byte < 192
    self.class.new [194, byte]
  else
    self.class.new [195, byte - 64]
  end
end

#to_codepoint ⇒ `Object`



37
38
39

# File 'lib/utf8_utils/char.rb', line 37

def to_codepoint
  flatten.map {|b| b.to_i }.pack("C*").unpack("U*")[0]
end

#to_s ⇒ `Object`

Get a multibyte character from the bytes.



33
34
35

# File 'lib/utf8_utils/char.rb', line 33

def to_s
  flatten.map {|b| b.to_i }.pack("C*").unpack("U*").pack("U*")
end

#valid? ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/utf8_utils/char.rb', line 41

def valid?
  return false if length != expected_length
  each_with_index do |byte, index|
    return false if byte.invalid?
    return false if index == 0 and byte.continuation?
    return false if index > 0 and !byte.continuation?
  end
  true
end

Class: UTF8Utils::Char

Instance Method Summary collapse

Instance Method Details

#expected_length ⇒ Object

#invalid? ⇒ Boolean

#tidy ⇒ Object

#to_codepoint ⇒ Object