Class: UTF8Utils::Byte

Inherits:

Object

Object
UTF8Utils::Byte

show all

Defined in:: lib/utf8_utils/byte.rb

Overview

A single UTF-8 byte.

Instance Attribute Summary collapse

#byte ⇒ Object readonly

Returns the value of attribute byte.

Instance Method Summary collapse

#codepoint_bits ⇒ Object
#codepoint_mask ⇒ Object
#continuation? ⇒ Boolean

Is this a continuation byte?.
#continuations ⇒ Object

How many continuation bytes should follow this byte?.
#initialize(byte) ⇒ Byte constructor

A new instance of Byte.
#invalid? ⇒ Boolean
#leading_1_bits ⇒ Object

From Wikipedia’s entry on UTF-8:.
#overlong? ⇒ Boolean

Start of a 2-byte sequence, but code point ≤ 127.
#restricted? ⇒ Boolean

RFC 3629 reserves 245-253 for the leading bytes of 4-6 byte sequences.
#to_i ⇒ Object
#undefined? ⇒ Boolean

Bytes 254 and 255 are not defined by the original UTF-8 spec.
#valid? ⇒ Boolean

Constructor Details

#initialize(byte) ⇒ `Byte`



8
9
10

# File 'lib/utf8_utils/byte.rb', line 8

def initialize(byte)
  @byte = byte
end

Instance Attribute Details

#byte ⇒ `Object` (readonly)

Returns the value of attribute byte.



6
7
8

# File 'lib/utf8_utils/byte.rb', line 6

def byte
  @byte
end

Instance Method Details

#codepoint_bits ⇒ `Object`



81
82
83

# File 'lib/utf8_utils/byte.rb', line 81

def codepoint_bits
  byte ^ codepoint_mask
end

#codepoint_mask ⇒ `Object`

# File 'lib/utf8_utils/byte.rb', line 12

def codepoint_mask
  case leading_1_bits
  when 0 then 0
  when 1 then 0b1000_0000
  when 2 then 0b1100_0000
  when 3 then 0b1110_0000
  when 4 then 0b1111_0000
  end
end

#continuation? ⇒ `Boolean`

Is this a continuation byte?



23
24
25

# File 'lib/utf8_utils/byte.rb', line 23

def continuation?
  leading_1_bits == 1
end

#continuations ⇒ `Object`

How many continuation bytes should follow this byte?

# File 'lib/utf8_utils/byte.rb', line 28

def continuations
  bits = leading_1_bits
  bits < 2 ? 0 : bits - 1
end

#invalid? ⇒ `Boolean`



33
34
35

# File 'lib/utf8_utils/byte.rb', line 33

def invalid?
  !valid?
end

#leading_1_bits ⇒ `Object`

From Wikipedia’s entry on UTF-8:

The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive 1 bits followed by a zero bit to indicate its type. N 1 bits indicates the first byte in a N-byte sequence, with the exception that zero 1 bits indicates a one-byte sequence while one 1 bit indicates a continuation byte in a multi-byte sequence (this was done for ASCII compatibility).

#overlong? ⇒ `Boolean`

Start of a 2-byte sequence, but code point ≤ 127

#restricted? ⇒ `Boolean`

RFC 3629 reserves 245-253 for the leading bytes of 4-6 byte sequences.

#to_i ⇒ `Object`



68
69
70

# File 'lib/utf8_utils/byte.rb', line 68

def to_i
  byte
end

#undefined? ⇒ `Boolean`

Bytes 254 and 255 are not defined by the original UTF-8 spec.



73
74
75

# File 'lib/utf8_utils/byte.rb', line 73

def undefined?
  (254..255) === byte
end

#valid? ⇒ `Boolean`



77
78
79

# File 'lib/utf8_utils/byte.rb', line 77

def valid?
  !(overlong? or restricted? or undefined?)
end

Class: UTF8Utils::Byte

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(byte) ⇒ Byte

Instance Attribute Details

#byte ⇒ Object (readonly)

Instance Method Details

#codepoint_bits ⇒ Object

#codepoint_mask ⇒ Object

#continuation? ⇒ Boolean

#continuations ⇒ Object

#invalid? ⇒ Boolean

#leading_1_bits ⇒ Object

#overlong? ⇒ Boolean

#restricted? ⇒ Boolean

#to_i ⇒ Object

#undefined? ⇒ Boolean

#valid? ⇒ Boolean