Class: UTF8Utils::Byte
- Inherits:
-
Object
- Object
- UTF8Utils::Byte
- Defined in:
- lib/utf8_utils/byte.rb
Overview
A single UTF-8 byte.
Instance Attribute Summary collapse
-
#byte ⇒ Object
readonly
Returns the value of attribute byte.
Instance Method Summary collapse
- #codepoint_bits ⇒ Object
- #codepoint_mask ⇒ Object
-
#continuation? ⇒ Boolean
Is this a continuation byte?.
-
#continuations ⇒ Object
How many continuation bytes should follow this byte?.
-
#initialize(byte) ⇒ Byte
constructor
A new instance of Byte.
- #invalid? ⇒ Boolean
-
#leading_1_bits ⇒ Object
From Wikipedia’s entry on UTF-8:.
-
#overlong? ⇒ Boolean
Start of a 2-byte sequence, but code point ≤ 127.
-
#restricted? ⇒ Boolean
RFC 3629 reserves 245-253 for the leading bytes of 4-6 byte sequences.
- #to_i ⇒ Object
-
#undefined? ⇒ Boolean
Bytes 254 and 255 are not defined by the original UTF-8 spec.
- #valid? ⇒ Boolean
Constructor Details
#initialize(byte) ⇒ Byte
8 9 10 |
# File 'lib/utf8_utils/byte.rb', line 8 def initialize(byte) @byte = byte end |
Instance Attribute Details
#byte ⇒ Object (readonly)
Returns the value of attribute byte.
6 7 8 |
# File 'lib/utf8_utils/byte.rb', line 6 def byte @byte end |
Instance Method Details
#codepoint_bits ⇒ Object
81 82 83 |
# File 'lib/utf8_utils/byte.rb', line 81 def codepoint_bits byte ^ codepoint_mask end |
#codepoint_mask ⇒ Object
12 13 14 15 16 17 18 19 20 |
# File 'lib/utf8_utils/byte.rb', line 12 def codepoint_mask case leading_1_bits when 0 then 0 when 1 then 0b1000_0000 when 2 then 0b1100_0000 when 3 then 0b1110_0000 when 4 then 0b1111_0000 end end |
#continuation? ⇒ Boolean
Is this a continuation byte?
23 24 25 |
# File 'lib/utf8_utils/byte.rb', line 23 def continuation? leading_1_bits == 1 end |
#continuations ⇒ Object
How many continuation bytes should follow this byte?
28 29 30 31 |
# File 'lib/utf8_utils/byte.rb', line 28 def continuations bits = leading_1_bits bits < 2 ? 0 : bits - 1 end |
#invalid? ⇒ Boolean
33 34 35 |
# File 'lib/utf8_utils/byte.rb', line 33 def invalid? !valid? end |
#leading_1_bits ⇒ Object
From Wikipedia’s entry on UTF-8:
The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive 1 bits followed by a zero bit to indicate its type. N 1 bits indicates the first byte in a N-byte sequence, with the exception that zero 1 bits indicates a one-byte sequence while one 1 bit indicates a continuation byte in a multi-byte sequence (this was done for ASCII compatibility).
46 47 48 49 50 51 52 53 54 |
# File 'lib/utf8_utils/byte.rb', line 46 def leading_1_bits nibble = byte >> 4 if nibble < 0b1000 then 0 # single-byte chars elsif nibble < 0b1100 then 1 # continuation byte elsif nibble < 0b1110 then 2 # start of 2-byte char elsif nibble < 0b1111 then 3 # 3-byte char else 4 # 4-byte char end end |
#overlong? ⇒ Boolean
Start of a 2-byte sequence, but code point ≤ 127
58 59 60 |
# File 'lib/utf8_utils/byte.rb', line 58 def overlong? (192..193) === byte end |
#restricted? ⇒ Boolean
RFC 3629 reserves 245-253 for the leading bytes of 4-6 byte sequences.
64 65 66 |
# File 'lib/utf8_utils/byte.rb', line 64 def restricted? (245..253) === byte end |
#to_i ⇒ Object
68 69 70 |
# File 'lib/utf8_utils/byte.rb', line 68 def to_i byte end |
#undefined? ⇒ Boolean
Bytes 254 and 255 are not defined by the original UTF-8 spec.
73 74 75 |
# File 'lib/utf8_utils/byte.rb', line 73 def undefined? (254..255) === byte end |
#valid? ⇒ Boolean
77 78 79 |
# File 'lib/utf8_utils/byte.rb', line 77 def valid? !(overlong? or restricted? or undefined?) end |