Class: UTF8Utils::Byte

Inherits:
Object
  • Object
show all
Defined in:
lib/utf8_utils/byte.rb

Overview

A single UTF-8 byte.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(byte) ⇒ Byte



8
9
10
# File 'lib/utf8_utils/byte.rb', line 8

def initialize(byte)
  @byte = byte
end

Instance Attribute Details

#byteObject (readonly)

Returns the value of attribute byte.



6
7
8
# File 'lib/utf8_utils/byte.rb', line 6

def byte
  @byte
end

Instance Method Details

#codepoint_bitsObject



81
82
83
# File 'lib/utf8_utils/byte.rb', line 81

def codepoint_bits
  byte ^ codepoint_mask
end

#codepoint_maskObject



12
13
14
15
16
17
18
19
20
# File 'lib/utf8_utils/byte.rb', line 12

def codepoint_mask
  case leading_1_bits
  when 0 then 0
  when 1 then 0b1000_0000
  when 2 then 0b1100_0000
  when 3 then 0b1110_0000
  when 4 then 0b1111_0000
  end
end

#continuation?Boolean

Is this a continuation byte?



23
24
25
# File 'lib/utf8_utils/byte.rb', line 23

def continuation?
  leading_1_bits == 1
end

#continuationsObject

How many continuation bytes should follow this byte?



28
29
30
31
# File 'lib/utf8_utils/byte.rb', line 28

def continuations
  bits = leading_1_bits
  bits < 2 ? 0 : bits - 1
end

#invalid?Boolean



33
34
35
# File 'lib/utf8_utils/byte.rb', line 33

def invalid?
  !valid?
end

#leading_1_bitsObject

From Wikipedia’s entry on UTF-8:

The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive 1 bits followed by a zero bit to indicate its type. N 1 bits indicates the first byte in a N-byte sequence, with the exception that zero 1 bits indicates a one-byte sequence while one 1 bit indicates a continuation byte in a multi-byte sequence (this was done for ASCII compatibility).



46
47
48
49
50
51
52
53
54
# File 'lib/utf8_utils/byte.rb', line 46

def leading_1_bits
  nibble = byte >> 4
  if    nibble < 0b1000 then 0 # single-byte chars
  elsif nibble < 0b1100 then 1 # continuation byte
  elsif nibble < 0b1110 then 2 # start of 2-byte char
  elsif nibble < 0b1111 then 3 # 3-byte char
  else                       4 # 4-byte char
  end
end

#overlong?Boolean

Start of a 2-byte sequence, but code point ≤ 127



58
59
60
# File 'lib/utf8_utils/byte.rb', line 58

def overlong?
  (192..193) === byte
end

#restricted?Boolean

RFC 3629 reserves 245-253 for the leading bytes of 4-6 byte sequences.



64
65
66
# File 'lib/utf8_utils/byte.rb', line 64

def restricted?
  (245..253) === byte
end

#to_iObject



68
69
70
# File 'lib/utf8_utils/byte.rb', line 68

def to_i
  byte
end

#undefined?Boolean

Bytes 254 and 255 are not defined by the original UTF-8 spec.



73
74
75
# File 'lib/utf8_utils/byte.rb', line 73

def undefined?
  (254..255) === byte
end

#valid?Boolean



77
78
79
# File 'lib/utf8_utils/byte.rb', line 77

def valid?
  !(overlong? or restricted? or undefined?)
end