Class: UnicodeString

Inherits:
String
  • Object
show all
Defined in:
lib/unicode_madness/unicode_string.rb

Direct Known Subclasses

JapaneseString

Instance Method Summary collapse

Instance Method Details

#codepointObject

Returns the UCS codepoint of this character. (This string must contain only one character.) Currently only UTF8 encoding is supported.



22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/unicode_madness/unicode_string.rb', line 22

def codepoint
  unless $KCODE =~ /^u/i
    raise ArgumentError, "unsupported encoding (#{$KCODE})"
  end
  unless jlength == 1
    raise RangeError, "string must be exactly one character long"
  end
  
 case self.length
  when 1
    UCSCodepoint.new(self[0])
  when 2
    UCSCodepoint.new(
      ((self[0] & 0x1f) << 6) +
      (self[1] & 0x3f)
    )
  when 3
    UCSCodepoint.new(
      ((self[0] & 0x0f) << 12) +
      ((self[1] & 0x3f) << 6) +
      (self[2] & 0x3f)
    )
  when 4
    UCSCodepoint.new(
      ((self[0] & 0x07) << 18) +
      ((self[1] & 0x3f) << 12) +
      ((self[2] & 0x3f) << 6) +
      (self[3] & 0x3f)
    )
  end
end

#index_to_uindex(byte_index) ⇒ Object

Converts a byte offset to a character offset. The byte offset must be greater than or equal to zero and less than or equal to the byte length of the string. Returns nil if the offset is in the middle of a character.



72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/unicode_madness/unicode_string.rb', line 72

def index_to_uindex(byte_index)
  return nil if byte_index.nil?
  if byte_index < 0 || byte_index > length
    raise RangeError, 'index out of range'
  end
  
  chars = split('')
  char_index = 0
  chars.each do |ch|
    break if byte_index == 0
    byte_index -= ch.length
    return nil if byte_index < 0
    char_index += 1
  end
  char_index
end

#kana?Boolean

Returns a Boolean indicating whether this character is a hiragana or katakana character. (This string must contain only one character.)

Returns:

  • (Boolean)


10
11
12
# File 'lib/unicode_madness/unicode_string.rb', line 10

def kana?
  codepoint.kana?
end

#kanji?Boolean

Returns a Boolean indicating whether this character is a kanji character. (This string must contain only one character.)

Returns:

  • (Boolean)


4
5
6
# File 'lib/unicode_madness/unicode_string.rb', line 4

def kanji?
  codepoint.kanji?
end

#uindex(substr, uoffset = 0) ⇒ Object

Like index, but returns a character offset instead of a byte offset. The starting offset is also in characters instead of bytes.



56
57
58
59
# File 'lib/unicode_madness/unicode_string.rb', line 56

def uindex(substr, uoffset = 0)
  offset = uindex_to_index(uoffset)
  index_to_uindex(index(substr, offset))
end

#uindex_to_index(char_index) ⇒ Object

Converts a character offset to a byte offset. The character offset must be greater than or equal to zero and less than or equal to the character length of the string.



92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/unicode_madness/unicode_string.rb', line 92

def uindex_to_index(char_index)
  return nil if char_index.nil?
  if char_index < 0 || char_index > jlength
    raise RangeError, 'index out of range'
  end
  
  chars = split('')
  byte_index = 0
  char_index.times do |i|
    byte_index += chars[i].length
  end
  byte_index
end

#uslice(uoffset, ulength) ⇒ Object

Like slice, but takes a character offset and length (instead of bytes). Can’t handle negative lengths.



63
64
65
66
67
# File 'lib/unicode_madness/unicode_string.rb', line 63

def uslice(uoffset, ulength)
  offset = uindex_to_index(uoffset)
  substr = slice(offset, length)
  substr.split('')[0,ulength].join('')
end

#wide_latin?Boolean

Returns a Boolean indicating whether this character is a full-width latin character. (This string must contain only one character.)

Returns:

  • (Boolean)


16
17
18
# File 'lib/unicode_madness/unicode_string.rb', line 16

def wide_latin?
  codepoint.wide_latin?
end