Class: UnicodeString
- Inherits:
-
String
- Object
- String
- UnicodeString
- Defined in:
- lib/unicode_madness/unicode_string.rb
Direct Known Subclasses
Instance Method Summary collapse
-
#codepoint ⇒ Object
Returns the UCS codepoint of this character.
-
#index_to_uindex(byte_index) ⇒ Object
Converts a byte offset to a character offset.
-
#kana? ⇒ Boolean
Returns a Boolean indicating whether this character is a hiragana or katakana character.
-
#kanji? ⇒ Boolean
Returns a Boolean indicating whether this character is a kanji character.
-
#uindex(substr, uoffset = 0) ⇒ Object
Like index, but returns a character offset instead of a byte offset.
-
#uindex_to_index(char_index) ⇒ Object
Converts a character offset to a byte offset.
-
#uslice(uoffset, ulength) ⇒ Object
Like slice, but takes a character offset and length (instead of bytes).
-
#wide_latin? ⇒ Boolean
Returns a Boolean indicating whether this character is a full-width latin character.
Instance Method Details
#codepoint ⇒ Object
Returns the UCS codepoint of this character. (This string must contain only one character.) Currently only UTF8 encoding is supported.
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/unicode_madness/unicode_string.rb', line 22 def codepoint unless $KCODE =~ /^u/i raise ArgumentError, "unsupported encoding (#{$KCODE})" end unless jlength == 1 raise RangeError, "string must be exactly one character long" end case self.length when 1 UCSCodepoint.new(self[0]) when 2 UCSCodepoint.new( ((self[0] & 0x1f) << 6) + (self[1] & 0x3f) ) when 3 UCSCodepoint.new( ((self[0] & 0x0f) << 12) + ((self[1] & 0x3f) << 6) + (self[2] & 0x3f) ) when 4 UCSCodepoint.new( ((self[0] & 0x07) << 18) + ((self[1] & 0x3f) << 12) + ((self[2] & 0x3f) << 6) + (self[3] & 0x3f) ) end end |
#index_to_uindex(byte_index) ⇒ Object
Converts a byte offset to a character offset. The byte offset must be greater than or equal to zero and less than or equal to the byte length of the string. Returns nil if the offset is in the middle of a character.
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/unicode_madness/unicode_string.rb', line 72 def index_to_uindex(byte_index) return nil if byte_index.nil? if byte_index < 0 || byte_index > length raise RangeError, 'index out of range' end chars = split('') char_index = 0 chars.each do |ch| break if byte_index == 0 byte_index -= ch.length return nil if byte_index < 0 char_index += 1 end char_index end |
#kana? ⇒ Boolean
Returns a Boolean indicating whether this character is a hiragana or katakana character. (This string must contain only one character.)
10 11 12 |
# File 'lib/unicode_madness/unicode_string.rb', line 10 def kana? codepoint.kana? end |
#kanji? ⇒ Boolean
Returns a Boolean indicating whether this character is a kanji character. (This string must contain only one character.)
4 5 6 |
# File 'lib/unicode_madness/unicode_string.rb', line 4 def kanji? codepoint.kanji? end |
#uindex(substr, uoffset = 0) ⇒ Object
Like index, but returns a character offset instead of a byte offset. The starting offset is also in characters instead of bytes.
56 57 58 59 |
# File 'lib/unicode_madness/unicode_string.rb', line 56 def uindex(substr, uoffset = 0) offset = uindex_to_index(uoffset) index_to_uindex(index(substr, offset)) end |
#uindex_to_index(char_index) ⇒ Object
Converts a character offset to a byte offset. The character offset must be greater than or equal to zero and less than or equal to the character length of the string.
92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/unicode_madness/unicode_string.rb', line 92 def uindex_to_index(char_index) return nil if char_index.nil? if char_index < 0 || char_index > jlength raise RangeError, 'index out of range' end chars = split('') byte_index = 0 char_index.times do |i| byte_index += chars[i].length end byte_index end |
#uslice(uoffset, ulength) ⇒ Object
Like slice, but takes a character offset and length (instead of bytes). Can’t handle negative lengths.
63 64 65 66 67 |
# File 'lib/unicode_madness/unicode_string.rb', line 63 def uslice(uoffset, ulength) offset = uindex_to_index(uoffset) substr = slice(offset, length) substr.split('')[0,ulength].join('') end |
#wide_latin? ⇒ Boolean
Returns a Boolean indicating whether this character is a full-width latin character. (This string must contain only one character.)
16 17 18 |
# File 'lib/unicode_madness/unicode_string.rb', line 16 def wide_latin? codepoint.wide_latin? end |