Module: UnicodeUtils

Defined in:
lib/unicode_utils.rb,
lib/unicode_utils/gc.rb,
lib/unicode_utils/nfc.rb,
lib/unicode_utils/nfd.rb,
lib/unicode_utils/sid.rb,
lib/unicode_utils/grep.rb,
lib/unicode_utils/nfkc.rb,
lib/unicode_utils/nfkd.rb,
lib/unicode_utils/debug.rb,
lib/unicode_utils/upcase.rb,
lib/unicode_utils/version.rb,
lib/unicode_utils/casefold.rb,
lib/unicode_utils/downcase.rb,
lib/unicode_utils/char_name.rb,
lib/unicode_utils/char_type.rb,
lib/unicode_utils/codepoint.rb,
lib/unicode_utils/each_word.rb,
lib/unicode_utils/titlecase.rb,
lib/unicode_utils/name_alias.rb,
lib/unicode_utils/read_cdata.rb,
lib/unicode_utils/cased_char_q.rb,
lib/unicode_utils/name_aliases.rb,
lib/unicode_utils/display_width.rb,
lib/unicode_utils/each_grapheme.rb,
lib/unicode_utils/simple_upcase.rb,
lib/unicode_utils/graphic_char_q.rb,
lib/unicode_utils/code_point_type.rb,
lib/unicode_utils/combining_class.rb,
lib/unicode_utils/jamo_short_name.rb,
lib/unicode_utils/simple_casefold.rb,
lib/unicode_utils/simple_downcase.rb,
lib/unicode_utils/east_asian_width.rb,
lib/unicode_utils/general_category.rb,
lib/unicode_utils/lowercase_char_q.rb,
lib/unicode_utils/titlecase_char_q.rb,
lib/unicode_utils/uppercase_char_q.rb,
lib/unicode_utils/char_display_width.rb,
lib/unicode_utils/conditional_casing.rb,
lib/unicode_utils/soft_dotted_char_q.rb,
lib/unicode_utils/white_space_char_q.rb,
lib/unicode_utils/case_ignorable_char_q.rb,
lib/unicode_utils/canonical_decomposition.rb,
lib/unicode_utils/canonical_equivalents_q.rb,
lib/unicode_utils/default_ignorable_char_q.rb,
lib/unicode_utils/compatibility_decomposition.rb,
lib/unicode_utils/hangul_syllable_decomposition.rb

Overview

This version of UnicodeUtils implements algorithms as defined by version 6.2.0 of the Unicode standard. Each public method is declared as a module_function of the UnicodeUtils module and defined in a separate file under the unicode_utils directory.

As a convenience, the toplevel unicode_utils file loads all methods (needs lots of memory!). Also as a convenience for irb usage, the file unicode_utils/u assigns the UnicodeUtils module to the toplevel U constant and loads all methods:

$ irb -r unicode_utils/u
irb(main):001:0> U.grep /angstrom/
=> [#<U+212B "Å" ANGSTROM SIGN utf8:e2,84,ab>]

If a method takes a character as argument (usually named char), that argument can be an integer or a string (in which case the first code point counts) or any other object that responds to ord by returning an integer.

All methods are non-destructive, string return values are in the same encoding as strings passed as arguments, which must be in one of the Unicode encodings.

Highlevel methods are:

UnicodeUtils.upcase

full conversion to uppercase

UnicodeUtils.downcase

full conversion to lowercase

UnicodeUtils.titlecase

full conversion to titlecase

UnicodeUtils.casefold

case folding (case insensitive string comparison)

UnicodeUtils.nfd

Normalization Form D

UnicodeUtils.nfc

Normalization Form C

UnicodeUtils.nfkd

Normalization Form KD

UnicodeUtils.nfkc

Normalization Form KC

UnicodeUtils.each_grapheme

grapheme boundaries

UnicodeUtils.each_word

word boundaries

UnicodeUtils.char_name

character names

UnicodeUtils.grep

find code points by character name

Defined Under Namespace

Modules: Impl Classes: Codepoint, NameAlias

Constant Summary collapse

GENERAL_CATEGORY_PER_CP_MAP =
Impl.read_general_category_per_cp("general_category_per_cp")
GENERAL_CATEGORY_RANGES =
Impl.read_general_category_ranges("general_category_ranges")
CP_PREFERRED_ALIAS_STRING_MAP =
Hash.new.tap do |map|
  NAME_ALIASES_MAP.each { |cp, aliases|
    al =
      (aliases.find { |al| al.type == :correction } ||
       aliases.find { |al| al.type == :control } ||
       aliases.find { |al| al.type == :figment } ||
       aliases.find { |al| al.type == :alternate })
    map[cp] = al.name if al
  }
end
SPECIAL_UPCASE_MAP =

:nodoc:

Impl.read_multivalued_map("special_uc_map")
VERSION =

Corresponds to the unicode_utils gem version.

Conforms to Semantic Versioning as documented at semver.org.

Summary: MAJOR.MINOR.PATCHLEVEL

  • A backwards incompatible change causes a change in MAJOR

  • New features or non-bugfix improvals cause a change in MINOR

  • Bugfixes increase only PATCHLEVEL.

  • Pre-release versions append more info after a dash.

"1.4.0"
UNICODE_VERSION =

The version of Unicode implemented by this version of UnicodeUtils.

require "unicode_utils/version"
puts "Unicode #{UnicodeUtils::UNICODE_VERSION}"
"6.2.0"
CASEFOLD_F_MAP =

:nodoc:

Impl.read_multivalued_map("casefold_f_map")
SPECIAL_DOWNCASE_MAP =

:nodoc:

Impl.read_multivalued_map("special_lc_map")
NAME_MAP =

:nodoc:

Impl.read_names("names")
GENERAL_CATEGORY_TYPE_MAP =
Hash.new.tap { |map|
  GENERAL_CATEGORY_ALIAS_MAP.each_pair { |short, long|
    if short.length == 2
      map[short] = GENERAL_CATEGORY_ALIAS_MAP[short[0].to_sym]
    end
  }
}
WORD_BREAK_MAP =

Maps code points to integer codes. For the integer code to property mapping, see #compile_word_break_property in data/compile.rb.

Impl.read_hexdigit_map("word_break_property")
SIMPLE_TITLECASE_MAP =

:nodoc:

Impl.read_code_point_map("simple_tc_map")
SPECIAL_TITLECASE_MAP =

:nodoc:

Impl.read_multivalued_map("special_tc_map")
CDATA_DIR =

Absolute path to the directory from which UnicodeUtils loads its compiled Unicode data files at runtime.

File.absolute_path(File.join(File.dirname(__FILE__), "..", "..", "cdata"))
NAME_ALIASES_MAP =

:nodoc:

Impl.read_name_aliases("name_aliases")
GENERAL_CATEGORY_BASIC_WIDTH_MAP =
Hash.new.tap do |h|
  GENERAL_CATEGORY_IS_GRAPHIC_MAP.each_pair { |key, value|
    if value && key != :Mn && key != :Me
      h[key] = 1
    else
      h[key] = 0
    end
  }
end
GRAPHEME_CLUSTER_BREAK_MAP =

Maps code points to integer codes. For the integer code to property mapping, see #compile_grapheme_break_property in data/compile.rb.

Impl.read_hexdigit_map("grapheme_break_property")
SIMPLE_UPCASE_MAP =

:nodoc:

Impl.read_code_point_map("simple_uc_map")
GENERAL_CATEGORY_IS_GRAPHIC_MAP =
{
  Lu: true, Ll: true, Lt: true, Lm: true, Lo: true,
  Mn: true, Mc: true, Me: true,
  Nd: true, Nl: true, No: true,
  Pc: true, Pd: true, Ps: true, Pe: true, Pi: true, Pf: true, Po: true,
  Sm: true, Sc: true, Sk: true, So: true,
  Zs: true, Zl: false, Zp: false,
  Cc: false, Cf: false, Cs: false, Co: false, Cn: false
}
GENERAL_CATEGORY_CODE_POINT_TYPE =
{
  Lu: :Graphic, Ll: :Graphic, Lt: :Graphic, Lm: :Graphic, Lo: :Graphic,
  Mn: :Graphic, Mc: :Graphic, Me: :Graphic,
  Nd: :Graphic, Nl: :Graphic, No: :Graphic,
  Pc: :Graphic, Pd: :Graphic, Ps: :Graphic,
    Pe: :Graphic, Pi: :Graphic, Pf: :Graphic, Po: :Graphic,
  Sm: :Graphic, Sc: :Graphic, Sk: :Graphic, So: :Graphic,
  Zs: :Graphic, Zl: :Format, Zp: :Format,
  Cc: :Control, Cf: :Format, Cs: :Surrogate, Co: :Private_Use,
  # Cn is splitted into two types (Reserved and Noncharacter)!
  Cn: false
}
CN_CODE_POINT_TYPE =

:nodoc:

Hash.new.tap { |h|
  h.default = :Reserved
  # Sixty-six code points are noncharacters
  ary = (0xFDD0..0xFDEF).to_a
  0.upto(16) { |d|
    ary << "#{d.to_s(16)}FFFE".to_i(16)
    ary << "#{d.to_s(16)}FFFF".to_i(16)
  }
  ary.each { |cp| h[cp] = :Noncharacter }
  raise "assertion error #{h.size}" unless h.size == 66
}
COMBINING_CLASS_MAP =

:nodoc:

Impl.read_combining_class_map()
JAMO_SHORT_NAME_MAP =

:nodoc:

Impl.read_names("jamo_short_names")
CASEFOLD_C_MAP =

:nodoc:

Impl.read_code_point_map("casefold_c_map")
CASEFOLD_S_MAP =

:nodoc:

Impl.read_code_point_map("casefold_s_map")
SIMPLE_DOWNCASE_MAP =

:nodoc:

Impl.read_code_point_map("simple_lc_map")
EAST_ASIAN_WIDTH_MAP_PER_CP =
Impl.read_east_asian_width_per_cp("east_asian_width_property_per_cp")
EAST_ASIAN_WIDTH_RANGES =
Impl.read_east_asian_width_ranges("east_asian_width_property_ranges")
GENERAL_CATEGORY_ALIAS_MAP =
Impl.read_symbol_map("general_category_aliases")
PROP_LOWERCASE_SET =

:nodoc:

Impl.read_code_point_set("prop_set_lowercase")
TITLECASE_LETTER_SET =

:nodoc:

Impl.read_code_point_set("cat_set_titlecase")
PROP_UPPERCASE_SET =

:nodoc:

Impl.read_code_point_set("prop_set_uppercase")
SOFT_DOTTED_SET =

:nodoc:

Impl.read_code_point_set("soft_dotted_set")
WHITE_SPACE_SET =

:nodoc:

Impl.read_code_point_set("white_space_set")
CASE_IGNORABLE_SET =

:nodoc:

Impl.read_code_point_set("case_ignorable_set")
CANONICAL_DECOMPOSITION_MAP =
Impl.read_multivalued_map("canonical_decomposition_map")
PROP_DEFAULT_IGNORABLE_SET =
Impl.read_code_point_set("prop_set_default_ignorable")
COMPATIBILITY_DECOMPOSITION_MAP =
Impl.read_multivalued_map("compatibility_decomposition_map")

Class Method Summary collapse

Class Method Details

.canonical_decomposition(str) ⇒ Object

Get the canonical decomposition of the given string, also called Normalization Form D or short NFD.

The Unicode standard has multiple representations for some characters. One representation as a single code point and other representation(s) as a combination of multiple code points. This function “decomposes” these characters in str into the latter representation.

Example:

require "unicode_utils/canonical_decomposition"
# LATIN SMALL LETTER A WITH ACUTE => LATIN SMALL LETTER A, COMBINING ACUTE ACCENT
UnicodeUtils.canonical_decomposition("\u{E1}") => "\u{61}\u{301}"

See also: UnicodeUtils.nfd



28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/unicode_utils/canonical_decomposition.rb', line 28

def canonical_decomposition(str)
  res = String.new.force_encoding(str.encoding)
  str.each_codepoint { |cp|
    if cp >= 0xAC00 && cp <= 0xD7A3 # hangul syllable
      Impl.append_hangul_syllable_decomposition(res, cp)
    else
      mapping = CANONICAL_DECOMPOSITION_MAP[cp]
      if mapping
        Impl.append_recursive_canonical_decomposition_mapping(res, mapping)
      else
        res << cp
      end
    end
  }
  Impl.put_into_canonical_order(res)
end

.canonical_equivalents?(a, b) ⇒ Boolean

The strings a and b are canonical equivalents if their canonical decompositions are equal.

Example:

require "unicode_utils/canonical_equivalents_q"
UnicodeUtils.canonical_equivalents?("Äste", "A\u{308}ste") => true
UnicodeUtils.canonical_equivalents?("Äste", "Aste") => false

Returns:

  • (Boolean)


15
16
17
18
# File 'lib/unicode_utils/canonical_equivalents_q.rb', line 15

def canonical_equivalents?(a, b)
  UnicodeUtils.canonical_decomposition(a) ==
    UnicodeUtils.canonical_decomposition(b)
end

.case_ignorable_char?(char) ⇒ Boolean

Returns true if the given character is case-ignorable as defined by Unicode 5.0, section 3.13.

Returns:

  • (Boolean)


11
12
13
# File 'lib/unicode_utils/case_ignorable_char_q.rb', line 11

def case_ignorable_char?(char)
  CASE_IGNORABLE_SET.include?(char.ord)
end

.cased_char?(char) ⇒ Boolean

A cased char is a character that has the Unicode property Lowercase or Uppercase or the general category Titlecase_Letter.

See also: lowercase_char?, uppercase_char?, titlecase_char?

Returns:

  • (Boolean)


13
14
15
# File 'lib/unicode_utils/cased_char_q.rb', line 13

def cased_char?(char)
  lowercase_char?(char) || uppercase_char?(char) || titlecase_char?(char)
end

.casefold(str) ⇒ Object

Perform full case folding. The returned string may be longer than str. The purpose of case folding is case insensitive string comparison.

Examples:

require "unicode_utils/casefold"
UnicodeUtils.casefold("Ümit") == UnicodeUtils.casefold("ümit") => true
UnicodeUtils.casefold("WEISS") == UnicodeUtils.casefold("weiß") => true


19
20
21
22
23
24
25
26
27
28
29
30
31
# File 'lib/unicode_utils/casefold.rb', line 19

def casefold(str)
  String.new.force_encoding(str.encoding).tap do |res|
    str.each_codepoint { |cp|
      if mapping = CASEFOLD_C_MAP[cp]
        res << mapping
      elsif mapping = CASEFOLD_F_MAP[cp]
        mapping.each { |m| res << m }
      else
        res << cp
      end
    }
  end
end

.char_display_width(char) ⇒ Object

Get the width of char when displayed with a fixed pitch font.

Some code points (especially from east asian scripts) take the width of two characters, while others have no width.

Examples:

require "unicode_utils/char_display_width"
UnicodeUtils.char_display_width("")  # => 2
UnicodeUtils.char_display_width(0x308) # => 0
UnicodeUtils.char_display_width("a")   # => 1

Performs the same logic as UnicodeUtils.display_width, but for a single code point.



21
22
23
24
25
26
27
28
# File 'lib/unicode_utils/char_display_width.rb', line 21

def char_display_width(char)
  cp = char.ord
  # copied from display_width, keep in sync!
  case UnicodeUtils.east_asian_width(cp)
  when :Wide, :Fullwidth then 2
  else GENERAL_CATEGORY_BASIC_WIDTH_MAP[UnicodeUtils.gc(cp)]
  end
end

.char_name(char) ⇒ Object

Get the normative Unicode name of the given character.

Private Use code points have no name, this function returns nil for such code points.

All control characters have the special name “<control>”. All other characters have a unique name.

Example:

require "unicode_utils/char_name"
UnicodeUtils.char_name "" => "GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI"
UnicodeUtils.char_name "\t" => "<control>"

Note that this method deviates from the Unicode Name property in two points:

  1. It returns “<control>” for control codes, the Unicode Name property for these code points is an empty string

  2. It returns nil for other non-graphic, non-format code points, the Unicode Name property for these code points is an empty string

See also: UnicodeUtils.sid



34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/unicode_utils/char_name.rb', line 34

def char_name(char)
  # TODO: improve with code point labels, see section 4.8 in Unicode 6.0.0
  if char.kind_of?(Integer)
    cp = char
    str = nil
  else
    cp = char.ord
    str = char
  end
  NAME_MAP[cp] ||
    case cp
    when 0x3400..0x4DB5, 0x4E00..0x9FCC, 0x20000..0x2A6D6, 0x2A700..0x2B734, 0x2B740..0x2B81D
      "CJK UNIFIED IDEOGRAPH-#{sprintf('%04X', cp)}"
    when 0xAC00..0xD7A3
      str ||= cp.chr(Encoding::UTF_8)
      "HANGUL SYLLABLE ".tap do |n|
        hangul_syllable_decomposition(str).each_char { |c|
          n << (jamo_short_name(c) || '')
        }
      end
    end
end

.char_type(char) ⇒ Object

Get the long major general category alias of char.

Example:

require "unicode_utils/char_type"
UnicodeUtils.char_type("1") # => :Number

Always returns a symbol when char is in the Unicode code point range.

See also: UnicodeUtils.general_category



27
28
29
# File 'lib/unicode_utils/char_type.rb', line 27

def char_type(char)
  GENERAL_CATEGORY_TYPE_MAP[UnicodeUtils.gc(char)]
end

.code_point_type(integer) ⇒ Object

Get the code point type of the given integer (must be instance of Integer) as defined by the Unicode standard.

If integer is a code point (anything in UnicodeUtils::Codepoint::RANGE), returns one of the following symbols:

:Graphic
:Format
:Control
:Private_Use
:Surrogate
:Noncharacter
:Reserved

For an exact meaning of these values, read the sections “Conformance/Characters and Encoding” and “General Structure/Types of Codepoints” in the Unicode standard.

Following is a paraphrased excerpt:

Surrogate, Noncharacter and Reserved code points are not assigned to an _abstract character_. All other code points are assigned to an abstract character.

Reserved code points are also called Undesignated code points, all others are Designated code points.

Returns nil if integer is not a code point.



61
62
63
64
65
66
67
# File 'lib/unicode_utils/code_point_type.rb', line 61

def code_point_type(integer)
  cpt = GENERAL_CATEGORY_CODE_POINT_TYPE[UnicodeUtils.gc(integer)]
  if false == cpt
    cpt = CN_CODE_POINT_TYPE[integer]
  end
  cpt
end

.combining_class(char) ⇒ Object

Get the combining class of the given character as an integer in the range 0..255.



12
13
14
# File 'lib/unicode_utils/combining_class.rb', line 12

def combining_class(char)
  COMBINING_CLASS_MAP[char.ord]
end

.compatibility_decomposition(str) ⇒ Object

Get the compatibility decomposition of the given string, also called Normalization Form KD or short NFKD.

Compatibility decomposition decomposes more code points than canonical decomposition and contrary to Normalization Form D and C, this normalization can alter how a string is displayed.

Example:

require "unicode_utils/compatibility_decomposition"
# LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I
UnicodeUtils.compatibility_decomposition("") => "fi"

See also: UnicodeUtils.nfkd



26
27
28
29
30
31
32
33
34
35
36
# File 'lib/unicode_utils/compatibility_decomposition.rb', line 26

def compatibility_decomposition(str)
  res = String.new.force_encoding(str.encoding)
  str.each_codepoint { |cp|
    if cp >= 0xAC00 && cp <= 0xD7A3 # hangul syllable
      Impl.append_hangul_syllable_decomposition(res, cp)
    else
      Impl.append_recursive_compatibility_decomposition_mapping(res, cp)
    end
  }
  Impl.put_into_canonical_order(res)
end

.debug(str, opts = {}) ⇒ Object

Print a table with detailed information about each code point in str. opts can have the following keys:

:io

An IO compatible object. Receives the output. Defaults to $stdout.

str may also be an Integer, in which case it is interpreted as a single code point that must be in UnicodeUtils::Codepoint::RANGE.

Examples:

$ ruby -r unicode_utils/u -e 'U.debug "良い一日"'
 Char | Ordinal | Sid                        | General Category | UTF-8
------+---------+----------------------------+------------------+----------
 "良" |    826F | CJK UNIFIED IDEOGRAPH-826F | Other_Letter     | E8 89 AF
 "い" |    3044 | HIRAGANA LETTER I          | Other_Letter     | E3 81 84
 "一" |    4E00 | CJK UNIFIED IDEOGRAPH-4E00 | Other_Letter     | E4 B8 80
 "日" |    65E5 | CJK UNIFIED IDEOGRAPH-65E5 | Other_Letter     | E6 97 A5

$ ruby -r unicode_utils/u -e 'U.debug 0xd800'
 Char | Ordinal | Sid              | General Category | UTF-8
------+---------+------------------+------------------+-------
 N/A  |    D800 | <surrogate-D800> | Surrogate        | N/A

The output is purely informal and may change even in minor releases.



37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/unicode_utils/debug.rb', line 37

def debug(str, opts = {})
  io = opts[:io] || $stdout
  table = [Impl::DEBUG_COLUMNS.keys]
  if str.kind_of?(Integer)
    table << Impl::DEBUG_COLUMNS.values.map { |f| f.call(str) }
  else
    str.each_codepoint { |cp|
      table << Impl::DEBUG_COLUMNS.values.map { |f| f.call(cp) }
    }
  end
  Impl.print_table(table, io)
  nil
end

.default_ignorable_char?(char) ⇒ Boolean

True if the given character has the Unicode property Default_Ingorable_Code_Point (see section 5.3 in Unicode 6.0.0).

When a system (e.g. font) can’t display a default ignorable code point, it is allowed to simply ignore, i.e. skip it (as opposed to other characters, which must at least be displayed with a replacement character).

Returns:

  • (Boolean)


17
18
19
# File 'lib/unicode_utils/default_ignorable_char_q.rb', line 17

def default_ignorable_char?(char)
  PROP_DEFAULT_IGNORABLE_SET.include?(char.ord)
end

.display_width(str) ⇒ Object

Get the width of str when displayed with a fixed pitch font.

Counts code points, where code points with an east asian width of Wide or Fullwidth count for two, non-graphic code points (e.g. control characters, including newline!) and non-spacing marks count for zero and all others count for one.

Examples:

require "unicode_utils/display_width"
"別れ".length => 2
UnicodeUtils.display_width("別れ") => 4
"12".length => 2
UnicodeUtils.display_width("12") => 2
"a\u{308}".length => 2
UnicodeUtils.display_width("a\u{308}") => 1

Unicode assigns some reserved code points an east asian width of Wide. Some systems correctly display a double width replacement character, others not.

See also: UnicodeUtils.graphic_char?, UnicodeUtils.east_asian_width



41
42
43
44
45
46
47
48
49
# File 'lib/unicode_utils/display_width.rb', line 41

def display_width(str)
  str.each_codepoint.reduce(0) { |sum, cp|
    sum +
      case UnicodeUtils.east_asian_width(cp)
      when :Wide, :Fullwidth then 2
      else GENERAL_CATEGORY_BASIC_WIDTH_MAP[UnicodeUtils.gc(cp)]
      end
  }
end

.downcase(str, language_id = nil) ⇒ Object

Perform a full case-conversion of str to lowercase according to the Unicode standard.

Some conversion rules are language dependent, these are in effect when a non-nil language_id is given. If non-nil, the language_id must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil should be passed explicitely, because a later version of UnicodeUtils may default to something else.

Examples:

require "unicode_utils/downcase"
UnicodeUtils.downcase("") => ""
UnicodeUtils.downcase("aBI\u{307}", :tr) => "abi"


28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/unicode_utils/downcase.rb', line 28

def downcase(str, language_id = nil)
  String.new.force_encoding(str.encoding).tap { |res|
    if Impl::LANGS_WITH_RULES.include?(language_id)
      # ensure O(1) lookup by index
      str = str.encode(Encoding::UTF_32LE)
    end
    pos = 0
    str.each_codepoint { |cp|
      special_mapping =
        Impl.conditional_downcase_mapping(cp, str, pos, language_id) ||
        SPECIAL_DOWNCASE_MAP[cp]
      if special_mapping
        special_mapping.each { |m| res << m }
      else
        res << (SIMPLE_DOWNCASE_MAP[cp] || cp)
      end
      pos += 1
    }
  }
end

.each_grapheme(str) {|grapheme| ... } ⇒ Object

Iterate over the grapheme clusters that make up str. A grapheme cluster is a user perceived character (the basic unit of a writing system for a language) and consists of one or more code points.

This method uses the default Unicode algorithm for extended grapheme clusters.

Returns an enumerator if no block is given.

Examples:

require "unicode_utils/each_grapheme"
UnicodeUtils.each_grapheme("a\r\nb") { |g| p g }

prints:

"a"
"\r\n"
"b"

and

UnicodeUtils.each_grapheme("a\r\nb").count => 3

Yields:

  • (grapheme)


35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/unicode_utils/each_grapheme.rb', line 35

def each_grapheme(str)
  return enum_for(__method__, str) unless block_given?
  c0 = nil
  c0_prop = nil
  grapheme = String.new.force_encoding(str.encoding)
  str.each_codepoint { |c|
    gbreak = false
    c_prop = GRAPHEME_CLUSTER_BREAK_MAP[c]
    
    ### rules ###
    if c0_prop == 0x0 && c_prop == 0x1
      # don't break CR LF
    elsif c0_prop == 0x0 || c0_prop == 0x1 || c0_prop == 0x2
      # break after controls
      gbreak = true
    elsif c_prop == 0x0 || c_prop == 0x1 || c_prop == 0x2
      # break before controls
      gbreak = true
    elsif c0_prop == 0x6 && (c_prop == 0x6 || c_prop == 0x7 ||
                             c_prop == 0x9 || c_prop == 0xA)
      # don't break hangul syllable
    elsif (c0_prop == 0x9 || c0_prop == 0x7) &&
          (c_prop == 0x7 || c_prop == 0x8)
      # don't break hangul syllable
    elsif (c0_prop == 0xA || c0_prop == 0x8) && c_prop == 0x8
      # don't break hangul syllable
    elsif c0_prop == 0xB && c_prop == 0xB
      # don't break between regional indicator symbols
    elsif c_prop == 0x3
      # don't break before extending characters
    elsif c_prop == 0x5
      # don't break before SpacingMarks
    elsif c0_prop == 0x4
      # don't break after Prepend characters
    else
      # break everywhere
      gbreak = true
    end
    #############

    if gbreak && !grapheme.empty?
      yield grapheme
      grapheme = String.new.force_encoding(str.encoding)
    end
    grapheme << c
    c0 = c
    c0_prop = c_prop
  }
  yield grapheme unless grapheme.empty?
end

.each_word(str) {|word| ... } ⇒ Object

Split str along word boundaries according to Unicode’s Default Word Boundary Specification, calling the given block with each word. Returns str, or an enumerator if no block is given.

Example:

require "unicode_utils/each_word"
UnicodeUtils.each_word("Hello, world!").to_a => ["Hello", ",", " ", "world", "!"]

Yields:

  • (word)


20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/unicode_utils/each_word.rb', line 20

def each_word(str)
  return enum_for(__method__, str) unless block_given?
  cs = str.each_codepoint.map { |c| WORD_BREAK_MAP[c] }
  cs << nil << nil # for negative indices
  word = String.new.force_encoding(str.encoding)
  i = 0
  str.each_codepoint { |c|
    word << c
    if Impl.word_break?(cs, i) && !word.empty?
      yield word
      word = String.new.force_encoding(str.encoding)
    end
    i += 1
  }
  yield word unless word.empty?
  str
end

.east_asian_width(char) ⇒ Object

Returns the default with of the given code point as described in “UAX #11: East Asian Width” (unicode.org/reports/tr11/).

Each code point is mapped to one of the following six symbols: :Neutral, :Ambiguous, :Halfwidth, :Wide, :Fullwidth, :Narrow.



18
19
20
21
22
23
24
# File 'lib/unicode_utils/east_asian_width.rb', line 18

def east_asian_width(char)
  cp = char.ord
  EAST_ASIAN_WIDTH_RANGES.each { |pair|
    return pair[1] if pair[0].cover?(cp)
  }
  EAST_ASIAN_WIDTH_MAP_PER_CP[cp]
end

.gc(char) ⇒ Object

Get the two letter general category alias of the given char. The first letter denotes a major class, the second letter a subclass of the major class.

See section 4.5 in Unicode 6.0.0.

Example:

require "unicode_utils/gc"
UnicodeUtils.gc("A") # => :Lu (Letter, uppercase)

Returns nil for ordinals outside the Unicode code point range, a two letter symbol otherwise.

See also: UnicodeUtils.general_category, UnicodeUtils.char_type



28
29
30
31
32
33
34
35
36
37
38
39
# File 'lib/unicode_utils/gc.rb', line 28

def gc(char)
  cp = char.ord
  cat = GENERAL_CATEGORY_PER_CP_MAP[cp] and return cat
  GENERAL_CATEGORY_RANGES.each { |pair|
    return pair[1] if pair[0].cover?(cp)
  }
  if cp >= 0x0 && cp <= 0x10FFFF
    :Cn # Other, not assigned
  else
    nil
  end
end

.general_category(char) ⇒ Object

Get the long general category alias of char.

Example:

require "unicode_utils/general_category"
UnicodeUtils.general_category("A") # => :Uppercase_Letter

Returns a symbol if char is in the Unicode code point range, nil otherwise.

See also: UnicodeUtils.gc, UnicodeUtils.char_type



22
23
24
# File 'lib/unicode_utils/general_category.rb', line 22

def general_category(char)
  GENERAL_CATEGORY_ALIAS_MAP[UnicodeUtils.gc(char)]
end

.graphic_char?(char) ⇒ Boolean

Returns true if the given char is a graphic char, false otherwise. See table 2-3 in section 2.4 of Unicode 6.0.0.

Examples:

require "unicode_utils/graphic_char_q"
UnicodeUtils.graphic_char?("a")  # => true
UnicodeUtils.graphic_char?("\n") # => false
UnicodeUtils.graphic_char?(0x0)  # => false

Returns:

  • (Boolean)


26
27
28
# File 'lib/unicode_utils/graphic_char_q.rb', line 26

def graphic_char?(char)
  GENERAL_CATEGORY_IS_GRAPHIC_MAP[UnicodeUtils.gc(char)]
end

.grep(regexp) ⇒ Object

Get an array of all Codepoint instances in Codepoint::RANGE whose name matches regexp. Matching is case insensitive.

require "unicode_utils/grep"
UnicodeUtils.grep(/angstrom/) => [#<U+212B "Å" ANGSTROM SIGN utf8:e2,84,ab>]


12
13
14
15
16
17
18
19
20
# File 'lib/unicode_utils/grep.rb', line 12

def grep(regexp)
  # TODO: enhance behaviour by searching aliases in NameAliases.txt
  unless regexp.casefold?
    regexp = Regexp.new(regexp.source, Regexp::IGNORECASE)
  end
  Codepoint::RANGE.select { |cp|
    regexp =~ UnicodeUtils.char_name(cp)
  }.map { |cp| Codepoint.new(cp) }
end

.hangul_syllable_decomposition(char) ⇒ Object

Derives the canonical decomposition of the given Hangul syllable.

Example:

require "unicode_utils/hangul_syllable_decomposition"
UnicodeUtils.hangul_syllable_decomposition("\u{d4db}") => "\u{1111}\u{1171}\u{11b6}"


11
12
13
14
15
# File 'lib/unicode_utils/hangul_syllable_decomposition.rb', line 11

def hangul_syllable_decomposition(char)
  String.new.force_encoding(char.encoding).tap do |str|
    Impl.append_hangul_syllable_decomposition(str , char.ord)
  end
end

.jamo_short_name(char) ⇒ Object

The Jamo Short Name property of the given character (defaults to nil).

Example:

require "unicode_utils/jamo_short_name"
UnicodeUtils.jamo_short_name("\u{1101}") => "GG"


16
17
18
# File 'lib/unicode_utils/jamo_short_name.rb', line 16

def jamo_short_name(char)
  JAMO_SHORT_NAME_MAP[char.ord]
end

.lowercase_char?(char) ⇒ Boolean

True if the given character has the Unicode property Lowercase.

Returns:

  • (Boolean)


10
11
12
# File 'lib/unicode_utils/lowercase_char_q.rb', line 10

def lowercase_char?(char)
  PROP_LOWERCASE_SET.include?(char.ord)
end

.name_aliases(char) ⇒ Object

Get an Enumerable of formal name aliases of the given character. Returns an empty Enumerable if the character doesn’t have an alias.

The aliases are instances of UnicodeUtils::NameAlias, the order of the aliases in the returned Enumerable is preserved from NameAliases.txt in the Unicode Character Database.

Example:

require "unicode_utils/name_aliases"
UnicodeUtils.name_aliases("\n").map(&:name) # => ["LINE FEED", "NEW LINE", "END OF LINE", "LF", "NL", "EOL"]

See also: UnicodeUtils.char_name



24
25
26
# File 'lib/unicode_utils/name_aliases.rb', line 24

def name_aliases(char)
  NAME_ALIASES_MAP[char.ord]
end

.nfc(str) ⇒ Object

Get str in Normalization Form C.

The Unicode standard has multiple representations for some characters. One representation as a single code point and other representation(s) as a combination of multiple code points. This function “composes” these characters into the former representation.

Example:

require "unicode_utils/nfc"
UnicodeUtils.nfc("La\u{308}mpchen") => "Lämpchen"


136
137
138
139
# File 'lib/unicode_utils/nfc.rb', line 136

def nfc(str)
  str = UnicodeUtils.canonical_decomposition(str)
  Impl.composition(str)
end

.nfd(str) ⇒ Object

Get str in Normalization Form D.

Alias for UnicodeUtils.canonical_decomposition.



10
11
12
# File 'lib/unicode_utils/nfd.rb', line 10

def nfd(str)
  UnicodeUtils.canonical_decomposition(str)
end

.nfkc(str) ⇒ Object

Get str in Normalization Form KC.

Normalization Form KC is compatibiliy decomposition (NFKD) followed by composition. Like NFKD, this normalization can alter how a string is displayed.

Example:

require "unicode_utils/nfkc"
# LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I
UnicodeUtils.nfkc("") => "fi"

See also: UnicodeUtils.compatibility_decomposition



21
22
23
24
# File 'lib/unicode_utils/nfkc.rb', line 21

def nfkc(str)
  str = UnicodeUtils.compatibility_decomposition(str)
  Impl.composition(str)
end

.nfkd(str) ⇒ Object

Get str in Normalization Form KD.

Alias for UnicodeUtils.compatibility_decomposition.



10
11
12
# File 'lib/unicode_utils/nfkd.rb', line 10

def nfkd(str)
  UnicodeUtils.compatibility_decomposition(str)
end

.sid(code_point) ⇒ Object

Returns a unique string identifier for every code point. Returns nil if code_point is not in the Unicode codespace. code_point must be an Integer.

The returned string identifier is either the non-empty Name property value of code_point, a non-empty Name_Alias string property value of code_point, or the code point label as described by section “Code Point Labels” in chapter 4.8 “Name” of the Unicode standard.

If the returned identifier starts with “<”, it is a code point label and it ends with “>”. Otherwise it is the normative name or a formal alias string.

The exact name/alias/label selection algorithm may change even in minor UnicodeUtils releases, but overall behaviour will stay the same in spirit.

The selection process in this version of UnicodeUtils is:

  1. Use an alias of type :correction, :control, :figment or :alternate (with listed precendence) if available

  2. Use the Unicode Name property value if it is not empty

  3. Construct a code point label in angle brackets.

Examples:

require "unicode_utils/sid"

U.sid 0xa     # => "LINE FEED"
U.sid 0x0     # => "NULL"
U.sid 0xfeff  # => "BYTE ORDER MARK"
U.sid 0xe000  # => "<private-use-E000>"
U.sid 0x61    # => "LATIN SMALL LETTER A"
U.sid -1      # => nil


53
54
55
56
57
58
59
60
# File 'lib/unicode_utils/sid.rb', line 53

def sid(code_point)
  s = CP_PREFERRED_ALIAS_STRING_MAP[code_point] and return s
  cn = UnicodeUtils.char_name(code_point)
  return cn if cn && cn !~ /\A(\<|\z)/
  ct = UnicodeUtils.code_point_type(code_point) or return nil
  ts = ct.to_s.downcase.gsub('_', '-')
  "<#{ts}-#{code_point.to_s(16).upcase.rjust(4, '0')}>"
end

.simple_casefold(str) ⇒ Object

Perform simple case folding. Contrary to full case folding, this uses only one to one mappings, so that the length of the returned string is equal to the length of str.

The purpose of case folding is case insensitive string comparison.

Examples:

require "unicode_utils/simple_casefold"
UnicodeUtils.simple_casefold("Ümit") == UnicodeUtils.simple_casefold("ümit") => true
UnicodeUtils.simple_casefold("WEISS") == UnicodeUtils.simple_casefold("weiß") => false

See also: UnicodeUtils.casefold



24
25
26
27
28
29
30
# File 'lib/unicode_utils/simple_casefold.rb', line 24

def simple_casefold(str)
  String.new.force_encoding(str.encoding).tap do |res|
    str.each_codepoint { |cp|
      res << (CASEFOLD_C_MAP[cp] || CASEFOLD_S_MAP[cp] || cp)
    }
  end
end

.simple_downcase(str) ⇒ Object

Map each code point in str that has a single code point lowercase-mapping to that lowercase mapping. The returned string has the same length as the original string.

This function is locale independent.

Examples:

require "unicode_utils/simple_downcase"
UnicodeUtils.simple_downcase("ÜMIT: 123") => "ümit: 123"
UnicodeUtils.simple_downcase("STRASSE") => "strasse"


20
21
22
23
24
25
26
# File 'lib/unicode_utils/simple_downcase.rb', line 20

def simple_downcase(str)
  String.new.force_encoding(str.encoding).tap { |res|
    str.each_codepoint { |cp|
      res << (SIMPLE_DOWNCASE_MAP[cp] || cp)
    }
  }
end

.simple_upcase(str) ⇒ Object

Map each code point in str that has a single code point uppercase-mapping to that uppercase mapping. The returned string has the same length as the original string.

This function is locale independent.

Examples:

require "unicode_utils/simple_upcase"
UnicodeUtils.simple_upcase("ümit: 123") => "ÜMIT: 123"
UnicodeUtils.simple_upcase("weiß") => "WEIß"


20
21
22
23
24
25
26
# File 'lib/unicode_utils/simple_upcase.rb', line 20

def simple_upcase(str)
  String.new.force_encoding(str.encoding).tap { |res|
    str.each_codepoint { |cp|
      res << (SIMPLE_UPCASE_MAP[cp] || cp)
    }
  }
end

.soft_dotted_char?(char) ⇒ Boolean

Returns true if the given character has the Unicode property Soft_Dotted.

Returns:

  • (Boolean)


11
12
13
# File 'lib/unicode_utils/soft_dotted_char_q.rb', line 11

def soft_dotted_char?(char)
  SOFT_DOTTED_SET.include?(char.ord)
end

.titlecase(str, language_id = nil) ⇒ Object

Convert the first cased character after each word boundary to titlecase and all other cased characters to lowercase. For many, but not all characters, the titlecase mapping is the same as the uppercase mapping.

Some conversion rules are language dependent, these are in effect when a non-nil language_id is given. If non-nil, the language_id must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil should be passed explicitely, because a later version of UnicodeUtils may default to something else.

Example:

require "unicode_utils/titlecase"
UnicodeUtils.titlecase("hello, world!") => "Hello, World!"


32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# File 'lib/unicode_utils/titlecase.rb', line 32

def titlecase(str, language_id = nil)
  String.new.force_encoding(str.encoding).tap do |res|
    # ensure O(1) lookup by index
    str = str.encode(Encoding::UTF_32LE)
    i = 0
    each_word(str) { |word|
      cased_char_found = false
      word.each_codepoint { |cp|
        cased = cased_char?(cp)
        if !cased_char_found && cased
          cased_char_found = true
          special_mapping =
            Impl.conditional_titlecase_mapping(cp, str, i, language_id) ||
            SPECIAL_TITLECASE_MAP[cp]
          if special_mapping
            special_mapping.each { |m| res << m }
          else
            res << (SIMPLE_TITLECASE_MAP[cp] || cp)
          end
        elsif cased
          special_mapping =
            Impl.conditional_downcase_mapping(cp, str, i, language_id) ||
            SPECIAL_DOWNCASE_MAP[cp]
          if special_mapping
            special_mapping.each { |m| res << m }
          else
            res << (SIMPLE_DOWNCASE_MAP[cp] || cp)
          end
        else
          res << cp
        end
        i += 1
      }
    }
  end
end

.titlecase_char?(char) ⇒ Boolean

True if the given character has the General_Category Titlecase_Letter (Lt).

Returns:

  • (Boolean)


11
12
13
# File 'lib/unicode_utils/titlecase_char_q.rb', line 11

def titlecase_char?(char)
  TITLECASE_LETTER_SET.include?(char.ord)
end

.upcase(str, language_id = nil) ⇒ Object

Perform a full case-conversion of str to uppercase according to the Unicode standard.

Some conversion rules are language dependent, these are in effect when a non-nil language_id is given. If non-nil, the language_id must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil should be passed explicitely, because a later version of UnicodeUtils may default to something else.

Examples:

require "unicode_utils/upcase"
UnicodeUtils.upcase("weiß") => "WEISS"
UnicodeUtils.upcase("i", :en) => "I"
UnicodeUtils.upcase("i", :tr) => "İ"


29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/unicode_utils/upcase.rb', line 29

def upcase(str, language_id = nil)
  String.new.force_encoding(str.encoding).tap { |res|
    if Impl::LANGS_WITH_RULES.include?(language_id)
      # ensure O(1) lookup by index
      str = str.encode(Encoding::UTF_32LE)
    end
    pos = 0
    str.each_codepoint { |cp|
      special_mapping =
        Impl.conditional_upcase_mapping(cp, str, pos, language_id) ||
        SPECIAL_UPCASE_MAP[cp]
      if special_mapping
        special_mapping.each { |m| res << m }
      else
        res << (SIMPLE_UPCASE_MAP[cp] || cp)
      end
      pos += 1
    }
  }
end

.uppercase_char?(char) ⇒ Boolean

True if the given character has the Unicode property Uppercase.

Returns:

  • (Boolean)


10
11
12
# File 'lib/unicode_utils/uppercase_char_q.rb', line 10

def uppercase_char?(char)
  PROP_UPPERCASE_SET.include?(char.ord)
end

.white_space_char?(char) ⇒ Boolean

True if the given character has the Unicode property White_Space.

Example:

require "unicode_utils/general_category"
require "unicode_utils/white_space_char_q"

UnicodeUtils.general_category("\n")   => :Control
UnicodeUtils.white_space_char?("\n")  => true

Returns:

  • (Boolean)


18
19
20
# File 'lib/unicode_utils/white_space_char_q.rb', line 18

def white_space_char?(char)
  WHITE_SPACE_SET.include?(char.ord)
end