Module: UnicodeUtils

Defined in:: lib/unicode_utils.rb,
lib/unicode_utils/nfc.rb,
lib/unicode_utils/nfd.rb,
lib/unicode_utils/grep.rb,
lib/unicode_utils/nfkc.rb,
lib/unicode_utils/nfkd.rb,
lib/unicode_utils/upcase.rb,
lib/unicode_utils/version.rb,
lib/unicode_utils/casefold.rb,
lib/unicode_utils/downcase.rb,
lib/unicode_utils/char_name.rb,
lib/unicode_utils/codepoint.rb,
lib/unicode_utils/each_word.rb,
lib/unicode_utils/titlecase.rb,
lib/unicode_utils/read_cdata.rb,
lib/unicode_utils/cased_char_q.rb,
lib/unicode_utils/each_grapheme.rb,
lib/unicode_utils/simple_upcase.rb,
lib/unicode_utils/combining_class.rb,
lib/unicode_utils/jamo_short_name.rb,
lib/unicode_utils/simple_casefold.rb,
lib/unicode_utils/simple_downcase.rb,
lib/unicode_utils/lowercase_char_q.rb,
lib/unicode_utils/titlecase_char_q.rb,
lib/unicode_utils/uppercase_char_q.rb,
lib/unicode_utils/conditional_casing.rb,
lib/unicode_utils/soft_dotted_char_q.rb,
lib/unicode_utils/case_ignorable_char_q.rb,
lib/unicode_utils/canonical_decomposition.rb,
lib/unicode_utils/canonical_equivalents_q.rb,
lib/unicode_utils/compatibility_decomposition.rb,
lib/unicode_utils/hangul_syllable_decomposition.rb

Overview

This version of UnicodeUtils implements algorithms as defined by version 6.0.0 of the Unicode standard. Each public method is declared as a module_function of the UnicodeUtils module and defined in a separate file under the unicode_utils directory.

As a convenience, the toplevel unicode_utils file loads all methods (needs lots of memory!). Also as a convenience for irb usage, the file unicode_utils/u assigns the UnicodeUtils module to the toplevel U constant and loads all methods:

$ irb -r unicode_utils/u
irb(main):001:0> U.grep /angstrom/
=> [#<U+212B "Å" ANGSTROM SIGN utf8:e2,84,ab>]

If a method takes a character as argument (usually named char), that argument can be an integer or a string (in which case the first codepoint counts) or any other object that responds to ord by returning an integer.

All methods are non-destructive, string return values are in the same encoding as strings passed as arguments, which must be in one of the Unicode encodings.

Highlevel methods are:

UnicodeUtils.upcase: full conversion to uppercase
UnicodeUtils.downcase: full conversion to lowercase
UnicodeUtils.titlecase: full conversion to titlecase
UnicodeUtils.casefold: case folding (case insensitive string comparison)
UnicodeUtils.nfd: Normalization Form D
UnicodeUtils.nfc: Normalization Form C
UnicodeUtils.nfkd: Normalization Form KD
UnicodeUtils.nfkc: Normalization Form KC
UnicodeUtils.each_grapheme: grapheme boundaries
UnicodeUtils.each_word: word boundaries
UnicodeUtils.char_name: character names
UnicodeUtils.grep: find codepoints by character name

Defined Under Namespace

Modules: Impl Classes: Codepoint

Constant Summary collapse

SPECIAL_UPCASE_MAP = :nodoc:

Impl.read_multivalued_map("special_uc_map")

VERSION = Corresponds to the unicode_utils gem version. MAJOR.MINOR.PATCHLEVEL A backwards incompatible change causes a change in MAJOR New features or non-bugfix improvals cause a change in MINOR Bugfixes increase only PATCHLEVEL. A release always has an even PATCHLEVEL. PATCHLEVEL is uneven during development.

"1.1.2"

CASEFOLD_F_MAP = :nodoc:

Impl.read_multivalued_map("casefold_f_map")

SPECIAL_DOWNCASE_MAP = :nodoc:

Impl.read_multivalued_map("special_lc_map")

NAME_MAP = :nodoc:

Impl.read_names("names")

WORD_BREAK_MAP = Maps codepoints to integer codes. For the integer code to property mapping, see #compile_word_break_property in data/compile.rb.

Impl.read_hexdigit_map("word_break_property")

SIMPLE_TITLECASE_MAP = :nodoc:

Impl.read_codepoint_map("simple_tc_map")

SPECIAL_TITLECASE_MAP = :nodoc:

Impl.read_multivalued_map("special_tc_map")

CDATA_DIR = Absolute path to the directory from which UnicodeUtils loads its compiled Unicode data files at runtime.

File.absolute_path(File.join(File.dirname(__FILE__), "..", "..", "cdata"))

GRAPHEME_CLUSTER_BREAK_MAP = Maps codepoints to integer codes. For the integer code to property mapping, see #compile_grapheme_break_property in data/compile.rb.

Impl.read_hexdigit_map("grapheme_break_property")

SIMPLE_UPCASE_MAP = :nodoc:

Impl.read_codepoint_map("simple_uc_map")

COMBINING_CLASS_MAP = :nodoc:

Impl.read_combining_class_map()

JAMO_SHORT_NAME_MAP = :nodoc:

Impl.read_names("jamo_short_names")

CASEFOLD_C_MAP = :nodoc:

Impl.read_codepoint_map("casefold_c_map")

CASEFOLD_S_MAP = :nodoc:

Impl.read_codepoint_map("casefold_s_map")

SIMPLE_DOWNCASE_MAP = :nodoc:

Impl.read_codepoint_map("simple_lc_map")

PROP_LOWERCASE_SET = :nodoc:

Impl.read_codepoint_set("prop_set_lowercase")

TITLECASE_LETTER_SET = :nodoc:

Impl.read_codepoint_set("cat_set_titlecase")

PROP_UPPERCASE_SET = :nodoc:

Impl.read_codepoint_set("prop_set_uppercase")

SOFT_DOTTED_SET = :nodoc:

Impl.read_codepoint_set("soft_dotted_set")

CASE_IGNORABLE_SET = :nodoc:

Impl.read_codepoint_set("case_ignorable_set")

CANONICAL_DECOMPOSITION_MAP =

Impl.read_multivalued_map("canonical_decomposition_map")

COMPATIBILITY_DECOMPOSITION_MAP =

Impl.read_multivalued_map("compatibility_decomposition_map")

Class Method Summary collapse

.canonical_decomposition(str) ⇒ Object

Get the canonical decomposition of the given string, also called Normalization Form D or short NFD.
.canonical_equivalents?(a, b) ⇒ Boolean

The strings a and b are canonical equivalents if their canonical decompositions are equal.
.case_ignorable_char?(char) ⇒ Boolean

Returns true if the given character is case-ignorable as defined by Unicode 5.0, section 3.13.
.cased_char?(char) ⇒ Boolean

A cased char is a character that has the Unicode property Lowercase or Uppercase or the general category Titlecase_Letter.
.casefold(str) ⇒ Object

Perform full case folding.
.char_name(char) ⇒ Object

Get the normative Unicode name of the given character.
.combining_class(char) ⇒ Object

Get the combining class of the given character as an integer in the range 0..255.
.compatibility_decomposition(str) ⇒ Object

Get the compatibility decomposition of the given string, also called Normalization Form KD or short NFKD.
.downcase(str, language_id = nil) ⇒ Object

Perform a full case-conversion of str to lowercase according to the Unicode standard.
.each_grapheme(str) {|grapheme| ... } ⇒ Object

Iterate over the grapheme clusters that make up str.
.each_word(str) {|word| ... } ⇒ Object

Split str along word boundaries according to Unicode’s Default Word Boundary Specification, calling the given block with each word.
.grep(regexp) ⇒ Object

Get an array of all Codepoint instances in Codepoint::RANGE whose name matches regexp.
.hangul_syllable_decomposition(char) ⇒ Object

Derives the canonical decomposition of the given Hangul syllable.
.jamo_short_name(char) ⇒ Object

The Jamo Short Name property of the given character (defaults to nil).
.lowercase_char?(char) ⇒ Boolean

True if the given character has the Unicode property Lowercase.
.nfc(str) ⇒ Object

Get str in Normalization Form C.
.nfd(str) ⇒ Object

Get str in Normalization Form D.
.nfkc(str) ⇒ Object

Get str in Normalization Form KC.
.nfkd(str) ⇒ Object

Get str in Normalization Form KD.
.simple_casefold(str) ⇒ Object

Perform simple case folding.
.simple_downcase(str) ⇒ Object

Map each codepoint in str that has a single codepoint lowercase-mapping to that lowercase mapping.
.simple_upcase(str) ⇒ Object

Map each codepoint in str that has a single codepoint uppercase-mapping to that uppercase mapping.
.soft_dotted_char?(char) ⇒ Boolean

Returns true if the given character has the Unicode property Soft_Dotted.
.titlecase(str, language_id = nil) ⇒ Object

Convert the first cased character after each word boundary to titlecase and all other cased characters to lowercase.
.titlecase_char?(char) ⇒ Boolean

True if the given character has the General_Category Titlecase_Letter (Lt).
.upcase(str, language_id = nil) ⇒ Object

Perform a full case-conversion of str to uppercase according to the Unicode standard.
.uppercase_char?(char) ⇒ Boolean

True if the given character has the Unicode property Uppercase.

Class Method Details

.canonical_decomposition(str) ⇒ `Object`

Get the canonical decomposition of the given string, also called Normalization Form D or short NFD.

The Unicode standard has multiple representations for some characters. One representation as a single codepoint and other representation(s) as a combination of multiple codepoints. This function “decomposes” these characters in str into the latter representation.

Example:

require "unicode_utils/canonical_decomposition"
# LATIN SMALL LETTER A WITH ACUTE => LATIN SMALL LETTER A, COMBINING ACUTE ACCENT
UnicodeUtils.canonical_decomposition("\u{E1}") => "\u{61}\u{301}"

.canonical_equivalents?(a, b) ⇒ `Boolean`

The strings a and b are canonical equivalents if their canonical decompositions are equal.

Example:

require "unicode_utils/canonical_equivalents_q"
UnicodeUtils.canonical_equivalents?("Äste", "A\u{308}ste") => true
UnicodeUtils.canonical_equivalents?("Äste", "Aste") => false

Returns:

# File 'lib/unicode_utils/canonical_equivalents_q.rb', line 15

def canonical_equivalents?(a, b)
  UnicodeUtils.canonical_decomposition(a) ==
    UnicodeUtils.canonical_decomposition(b)
end

.case_ignorable_char?(char) ⇒ `Boolean`

Returns true if the given character is case-ignorable as defined by Unicode 5.0, section 3.13.

Returns:



11
12
13

# File 'lib/unicode_utils/case_ignorable_char_q.rb', line 11

def case_ignorable_char?(char)
  CASE_IGNORABLE_SET.include?(char.ord)
end

.cased_char?(char) ⇒ `Boolean`

A cased char is a character that has the Unicode property Lowercase or Uppercase or the general category Titlecase_Letter.

See also: lowercase_char?, uppercase_char?, titlecase_char?

Returns:



13
14
15

# File 'lib/unicode_utils/cased_char_q.rb', line 13

def cased_char?(char)
  lowercase_char?(char) || uppercase_char?(char) || titlecase_char?(char)
end

.casefold(str) ⇒ `Object`

Perform full case folding. The returned string may be longer than str. The purpose of case folding is case insensitive string comparison.

Examples:

require "unicode_utils/casefold"
UnicodeUtils.casefold("Ümit") == UnicodeUtils.casefold("ümit") => true
UnicodeUtils.casefold("WEISS") == UnicodeUtils.casefold("weiß") => true

# File 'lib/unicode_utils/casefold.rb', line 19

def casefold(str)
  String.new.force_encoding(str.encoding).tap do |res|
    str.each_codepoint { |cp|
      if mapping = CASEFOLD_C_MAP[cp]
        res << mapping
      elsif mapping = CASEFOLD_F_MAP[cp]
        mapping.each { |m| res << m }
      else
        res << cp
      end
    }
  end
end

.char_name(char) ⇒ `Object`

Get the normative Unicode name of the given character.

Private Use codepoints have no name, this function returns nil for such codepoints.

All control characters have the special name “<control>”. All other characters have a unique name.

Example:

require "unicode_utils/char_name"
UnicodeUtils.char_name "ᾀ" => "GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI"
UnicodeUtils.char_name "\t" => "<control>"

# File 'lib/unicode_utils/char_name.rb', line 24

def char_name(char)
  # TODO: improve with code point labels, see section 4.8 in Unicode 6.0.0
  if char.kind_of?(Integer)
    cp = char
    str = nil
  else
    cp = char.ord
    str = char
  end
  NAME_MAP[cp] ||
    case cp
    when 0x3400..0x4DB5, 0x4E00..0x9FC3, 0x20000..0x2A6D6, 0x2A700..0x2B734, 0x2B740..0x2B81D
      "CJK UNIFIED IDEOGRAPH-#{sprintf('%04X', cp)}"
    when 0xAC00..0xD7A3
      str ||= cp.chr(Encoding::UTF_8)
      "HANGUL SYLLABLE ".tap do |n|
        hangul_syllable_decomposition(str).each_char { |c|
          n << (jamo_short_name(c) || '')
        }
      end
    end
end

.combining_class(char) ⇒ `Object`

Get the combining class of the given character as an integer in the range 0..255.



12
13
14

# File 'lib/unicode_utils/combining_class.rb', line 12

def combining_class(char)
  COMBINING_CLASS_MAP[char.ord]
end

.compatibility_decomposition(str) ⇒ `Object`

Get the compatibility decomposition of the given string, also called Normalization Form KD or short NFKD.

Compatibility decomposition decomposes more codepoints than canonical decomposition and contrary to Normalization Form D and C, this normalization can alter how a string is displayed.

Example:

require "unicode_utils/compatibility_decomposition"
# LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I
UnicodeUtils.compatibility_decomposition("ﬁ") => "fi"

.downcase(str, language_id = nil) ⇒ `Object`

Perform a full case-conversion of str to lowercase according to the Unicode standard.

Some conversion rules are language dependent, these are in effect when a non-nil language_id is given. If non-nil, the language_id must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil should be passed explicitely, because a later version of UnicodeUtils may default to something else.

Examples:

require "unicode_utils/downcase"
UnicodeUtils.downcase("ᾈ") => "ᾀ"
UnicodeUtils.downcase("aBI\u{307}", :tr) => "abi"

# File 'lib/unicode_utils/downcase.rb', line 28

def downcase(str, language_id = nil)
  String.new.force_encoding(str.encoding).tap { |res|
    if Impl::LANGS_WITH_RULES.include?(language_id)
      # ensure O(1) lookup by index
      str = str.encode(Encoding::UTF_32LE)
    end
    pos = 0
    str.each_codepoint { |cp|
      special_mapping =
        Impl.conditional_downcase_mapping(cp, str, pos, language_id) ||
        SPECIAL_DOWNCASE_MAP[cp]
      if special_mapping
        special_mapping.each { |m| res << m }
      else
        res << (SIMPLE_DOWNCASE_MAP[cp] || cp)
      end
      pos += 1
    }
  }
end

.each_grapheme(str) {|grapheme| ... } ⇒ `Object`

Iterate over the grapheme clusters that make up str. A grapheme cluster is a user perceived character (the basic unit of a writing system for a language) and consists of one or more codepoints.

This method uses the default Unicode algorithm for extended grapheme clusters.

Returns an enumerator if no block is given.

Examples:

require "unicode_utils/each_grapheme"
UnicodeUtils.each_grapheme("a\r\nb") { |g| p g }

prints:

"a"
"\r\n"
"b"

and

UnicodeUtils.each_grapheme("a\r\nb").count => 3

Yields:

(grapheme)

# File 'lib/unicode_utils/each_grapheme.rb', line 35

def each_grapheme(str)
  return enum_for(__method__, str) unless block_given?
  c0 = nil
  c0_prop = nil
  grapheme = String.new.force_encoding(str.encoding)
  str.each_codepoint { |c|
    gbreak = false
    c_prop = GRAPHEME_CLUSTER_BREAK_MAP[c]
    
    ### rules ###
    if c0_prop == 0x0 && c_prop == 0x1
      # don't break CR LF
    elsif c0_prop == 0x0 || c0_prop == 0x1 || c0_prop == 0x2
      # break after controls
      gbreak = true
    elsif c_prop == 0x0 || c_prop == 0x1 || c_prop == 0x2
      # break before controls
      gbreak = true
    elsif c0_prop == 0x6 && (c_prop == 0x6 || c_prop == 0x7 ||
                             c_prop == 0x9 || c_prop == 0xA)
      # don't break hangul syllable
    elsif (c0_prop == 0x9 || c0_prop == 0x7) &&
          (c_prop == 0x7 || c_prop == 0x8)
      # don't break hangul syllable
    elsif (c0_prop == 0xA || c0_prop == 0x8) && c_prop == 0x8
      # don't break hangul syllable
    elsif c_prop == 0x3
      # don't break before extending characters
    elsif c_prop == 0x5
      # don't break before SpacingMarks
    elsif c0_prop == 0x4
      # don't break after Prepend characters
    else
      # break everywhere
      gbreak = true
    end
    #############

    if gbreak && !grapheme.empty?
      yield grapheme
      grapheme = String.new.force_encoding(str.encoding)
    end
    grapheme << c
    c0 = c
    c0_prop = c_prop
  }
  yield grapheme unless grapheme.empty?
end

.each_word(str) {|word| ... } ⇒ `Object`

Split str along word boundaries according to Unicode’s Default Word Boundary Specification, calling the given block with each word. Returns str, or an enumerator if no block is given.

Example:

require "unicode_utils/each_word"
UnicodeUtils.each_word("Hello, world!").to_a => ["Hello", ",", " ", "world", "!"]

Yields:

(word)

# File 'lib/unicode_utils/each_word.rb', line 20

def each_word(str)
  return enum_for(__method__, str) unless block_given?
  cs = str.each_codepoint.map { |c| WORD_BREAK_MAP[c] }
  cs << nil << nil # for negative indices
  word = String.new.force_encoding(str.encoding)
  i = 0
  str.each_codepoint { |c|
    word << c
    if Impl.word_break?(cs, i) && !word.empty?
      yield word
      word = String.new.force_encoding(str.encoding)
    end
    i += 1
  }
  yield word unless word.empty?
  str
end

.grep(regexp) ⇒ `Object`

Get an array of all Codepoint instances in Codepoint::RANGE whose name matches regexp. Matching is case insensitive.

require "unicode_utils/grep"
UnicodeUtils.grep(/angstrom/) => [#<U+212B "Å" ANGSTROM SIGN utf8:e2,84,ab>]

# File 'lib/unicode_utils/grep.rb', line 12

def grep(regexp)
  # TODO: enhance behaviour by searching aliases in NameAliases.txt
  unless regexp.casefold?
    regexp = Regexp.new(regexp.source, Regexp::IGNORECASE)
  end
  Codepoint::RANGE.select { |cp|
    regexp =~ UnicodeUtils.char_name(cp)
  }.map { |cp| Codepoint.new(cp) }
end

.hangul_syllable_decomposition(char) ⇒ `Object`

Derives the canonical decomposition of the given Hangul syllable.

Example:

require "unicode_utils/hangul_syllable_decomposition"
UnicodeUtils.hangul_syllable_decomposition("\u{d4db}") => "\u{1111}\u{1171}\u{11b6}"

# File 'lib/unicode_utils/hangul_syllable_decomposition.rb', line 11

def hangul_syllable_decomposition(char)
  String.new.force_encoding(char.encoding).tap do |str|
    Impl.append_hangul_syllable_decomposition(str , char.ord)
  end
end

.jamo_short_name(char) ⇒ `Object`

The Jamo Short Name property of the given character (defaults to nil).

Example:

require "unicode_utils/jamo_short_name"
UnicodeUtils.jamo_short_name("\u{1101}") => "GG"



16
17
18

# File 'lib/unicode_utils/jamo_short_name.rb', line 16

def jamo_short_name(char)
  JAMO_SHORT_NAME_MAP[char.ord]
end

.lowercase_char?(char) ⇒ `Boolean`

True if the given character has the Unicode property Lowercase.

Returns:



10
11
12

# File 'lib/unicode_utils/lowercase_char_q.rb', line 10

def lowercase_char?(char)
  PROP_LOWERCASE_SET.include?(char.ord)
end

.nfc(str) ⇒ `Object`

Get str in Normalization Form C.

The Unicode standard has multiple representations for some characters. One representation as a single codepoint and other representation(s) as a combination of multiple codepoints. This function “composes” these characters into the former representation.

Example:

require "unicode_utils/nfc"
UnicodeUtils.nfc("La\u{308}mpchen") => "Lämpchen"

# File 'lib/unicode_utils/nfc.rb', line 136

def nfc(str)
  str = UnicodeUtils.canonical_decomposition(str)
  Impl.composition(str)
end

.nfd(str) ⇒ `Object`

Get str in Normalization Form D.

Alias for UnicodeUtils.canonical_decomposition.



10
11
12

# File 'lib/unicode_utils/nfd.rb', line 10

def nfd(str)
  UnicodeUtils.canonical_decomposition(str)
end

.nfkc(str) ⇒ `Object`

Get str in Normalization Form KC.

Normalization Form KC is compatibiliy decomposition (NFKD) followed by composition. Like NFKD, this normalization can alter how a string is displayed.

Example:

require "unicode_utils/nfkc"
# LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I
UnicodeUtils.nfkc("ﬁ") => "fi"

See also: UnicodeUtils.compatibility_decomposition

# File 'lib/unicode_utils/nfkc.rb', line 21

def nfkc(str)
  str = UnicodeUtils.compatibility_decomposition(str)
  Impl.composition(str)
end

.nfkd(str) ⇒ `Object`

Get str in Normalization Form KD.

Alias for UnicodeUtils.compatibility_decomposition.



10
11
12

# File 'lib/unicode_utils/nfkd.rb', line 10

def nfkd(str)
  UnicodeUtils.compatibility_decomposition(str)
end

.simple_casefold(str) ⇒ `Object`

Perform simple case folding. Contrary to full case folding, this uses only one to one mappings, so that the length of the returned string is equal to the length of str.

The purpose of case folding is case insensitive string comparison.

Examples:

require "unicode_utils/simple_casefold"
UnicodeUtils.simple_casefold("Ümit") == UnicodeUtils.simple_casefold("ümit") => true
UnicodeUtils.simple_casefold("WEISS") == UnicodeUtils.simple_casefold("weiß") => false

.simple_downcase(str) ⇒ `Object`

Map each codepoint in str that has a single codepoint lowercase-mapping to that lowercase mapping. The returned string has the same length as the original string.

This function is locale independent.

Examples:

require "unicode_utils/simple_downcase"
UnicodeUtils.simple_downcase("ÜMIT: 123") => "ümit: 123"
UnicodeUtils.simple_downcase("STRASSE") => "strasse"

# File 'lib/unicode_utils/simple_downcase.rb', line 20

def simple_downcase(str)
  String.new.force_encoding(str.encoding).tap { |res|
    str.each_codepoint { |cp|
      res << (SIMPLE_DOWNCASE_MAP[cp] || cp)
    }
  }
end

.simple_upcase(str) ⇒ `Object`

Map each codepoint in str that has a single codepoint uppercase-mapping to that uppercase mapping. The returned string has the same length as the original string.

This function is locale independent.

Examples:

require "unicode_utils/simple_upcase"
UnicodeUtils.simple_upcase("ümit: 123") => "ÜMIT: 123"
UnicodeUtils.simple_upcase("weiß") => "WEIß"

# File 'lib/unicode_utils/simple_upcase.rb', line 20

def simple_upcase(str)
  String.new.force_encoding(str.encoding).tap { |res|
    str.each_codepoint { |cp|
      res << (SIMPLE_UPCASE_MAP[cp] || cp)
    }
  }
end

.soft_dotted_char?(char) ⇒ `Boolean`

Returns true if the given character has the Unicode property Soft_Dotted.

Returns:



11
12
13

# File 'lib/unicode_utils/soft_dotted_char_q.rb', line 11

def soft_dotted_char?(char)
  SOFT_DOTTED_SET.include?(char.ord)
end

.titlecase(str, language_id = nil) ⇒ `Object`

Convert the first cased character after each word boundary to titlecase and all other cased characters to lowercase. For many, but not all characters, the titlecase mapping is the same as the uppercase mapping.

Example:

require "unicode_utils/titlecase"
UnicodeUtils.titlecase("hello, world!") => "Hello, World!"

# File 'lib/unicode_utils/titlecase.rb', line 32

def titlecase(str, language_id = nil)
  String.new.force_encoding(str.encoding).tap do |res|
    # ensure O(1) lookup by index
    str = str.encode(Encoding::UTF_32LE)
    i = 0
    each_word(str) { |word|
      cased_char_found = false
      word.each_codepoint { |cp|
        cased = cased_char?(cp)
        if !cased_char_found && cased
          cased_char_found = true
          special_mapping =
            Impl.conditional_titlecase_mapping(cp, str, i, language_id) ||
            SPECIAL_TITLECASE_MAP[cp]
          if special_mapping
            special_mapping.each { |m| res << m }
          else
            res << (SIMPLE_TITLECASE_MAP[cp] || cp)
          end
        elsif cased
          special_mapping =
            Impl.conditional_downcase_mapping(cp, str, i, language_id) ||
            SPECIAL_DOWNCASE_MAP[cp]
          if special_mapping
            special_mapping.each { |m| res << m }
          else
            res << (SIMPLE_DOWNCASE_MAP[cp] || cp)
          end
        else
          res << cp
        end
        i += 1
      }
    }
  end
end

.titlecase_char?(char) ⇒ `Boolean`

True if the given character has the General_Category Titlecase_Letter (Lt).

Returns:



11
12
13

# File 'lib/unicode_utils/titlecase_char_q.rb', line 11

def titlecase_char?(char)
  TITLECASE_LETTER_SET.include?(char.ord)
end

.upcase(str, language_id = nil) ⇒ `Object`

Perform a full case-conversion of str to uppercase according to the Unicode standard.

Examples:

require "unicode_utils/upcase"
UnicodeUtils.upcase("weiß") => "WEISS"
UnicodeUtils.upcase("i", :en) => "I"
UnicodeUtils.upcase("i", :tr) => "İ"

# File 'lib/unicode_utils/upcase.rb', line 29

def upcase(str, language_id = nil)
  String.new.force_encoding(str.encoding).tap { |res|
    if Impl::LANGS_WITH_RULES.include?(language_id)
      # ensure O(1) lookup by index
      str = str.encode(Encoding::UTF_32LE)
    end
    pos = 0
    str.each_codepoint { |cp|
      special_mapping =
        Impl.conditional_upcase_mapping(cp, str, pos, language_id) ||
        SPECIAL_UPCASE_MAP[cp]
      if special_mapping
        special_mapping.each { |m| res << m }
      else
        res << (SIMPLE_UPCASE_MAP[cp] || cp)
      end
      pos += 1
    }
  }
end

.uppercase_char?(char) ⇒ `Boolean`

True if the given character has the Unicode property Uppercase.

Returns:



10
11
12

# File 'lib/unicode_utils/uppercase_char_q.rb', line 10

def uppercase_char?(char)
  PROP_UPPERCASE_SET.include?(char.ord)
end

Module: UnicodeUtils

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.canonical_decomposition(str) ⇒ Object

.canonical_equivalents?(a, b) ⇒ Boolean

.case_ignorable_char?(char) ⇒ Boolean

.cased_char?(char) ⇒ Boolean

.casefold(str) ⇒ Object

.char_name(char) ⇒ Object

.combining_class(char) ⇒ Object

.compatibility_decomposition(str) ⇒ Object

.downcase(str, language_id = nil) ⇒ Object

.each_grapheme(str) {|grapheme| ... } ⇒ Object

.each_word(str) {|word| ... } ⇒ Object

.grep(regexp) ⇒ Object

.hangul_syllable_decomposition(char) ⇒ Object

.jamo_short_name(char) ⇒ Object

.lowercase_char?(char) ⇒ Boolean

.nfc(str) ⇒ Object

.nfd(str) ⇒ Object

.nfkc(str) ⇒ Object

.nfkd(str) ⇒ Object

.simple_casefold(str) ⇒ Object

.simple_downcase(str) ⇒ Object

.simple_upcase(str) ⇒ Object

.soft_dotted_char?(char) ⇒ Boolean

.titlecase(str, language_id = nil) ⇒ Object

.titlecase_char?(char) ⇒ Boolean

.upcase(str, language_id = nil) ⇒ Object

.uppercase_char?(char) ⇒ Boolean

.canonical_decomposition(str) ⇒ `Object`

.canonical_equivalents?(a, b) ⇒ `Boolean`

.case_ignorable_char?(char) ⇒ `Boolean`

.cased_char?(char) ⇒ `Boolean`

.casefold(str) ⇒ `Object`

.char_name(char) ⇒ `Object`

.combining_class(char) ⇒ `Object`

.compatibility_decomposition(str) ⇒ `Object`

.downcase(str, language_id = nil) ⇒ `Object`

.each_grapheme(str) {|grapheme| ... } ⇒ `Object`

.each_word(str) {|word| ... } ⇒ `Object`

.grep(regexp) ⇒ `Object`

.hangul_syllable_decomposition(char) ⇒ `Object`

.jamo_short_name(char) ⇒ `Object`

.lowercase_char?(char) ⇒ `Boolean`

.nfc(str) ⇒ `Object`

.nfd(str) ⇒ `Object`

.nfkc(str) ⇒ `Object`

.nfkd(str) ⇒ `Object`

.simple_casefold(str) ⇒ `Object`

.simple_downcase(str) ⇒ `Object`

.simple_upcase(str) ⇒ `Object`

.soft_dotted_char?(char) ⇒ `Boolean`

.titlecase(str, language_id = nil) ⇒ `Object`

.titlecase_char?(char) ⇒ `Boolean`

.upcase(str, language_id = nil) ⇒ `Object`

.uppercase_char?(char) ⇒ `Boolean`