Unicode::Categories [version] [ci]

Returns a list which General Categories a Unicode string belongs to.

Unicode version: 15.0.0 (September 2022)

Supported Rubies: 3.1, 3.0, 2.7

Old Rubies that might still work: 2.6, 2.5, 2.4, 2.3, 2.2, 2.1, 2.0


ruby gem "unicode-categories"


```ruby require “unicode/categories”

All general categories of a string

Unicode::Categories.categories(“A 2”) # => [“Lu”, “Nd”, “Zs”] Unicode::Categories.categories(“A 2”, format: :long) # => [“Decimal_Number”, “Space_Separator”, “Uppercase_Letter”]

Also aliased as .of

Unicode::Categories.of(“\u10c50”) # => [“Cn”]

Single character

Unicode::Categories.category(“☼”, format: :long) # => “Other_Symbol” ```

The list of categories is always sorted alphabetically.


Regex Matching

If you have a string and want to match a substring/character from a specific Unicode block, you actually won’t need this gem. Instead, you can use the Regexp Unicode Property Syntax \p{}:

ruby "Find decimal numbers (like 2 or 3) within a string".scan(/\p{Nd}+/) # => ["2", "3"]

See Idiosyncratic Ruby: Proper Unicoding for more info.

List of General Categories

You can retrieve a list of all General Categories like this:

ruby require "unicode/categories" puts \ "Short | Long\n" + "------|-----\n" + Unicode::Categories.names(format: :table).to_a.map{ |r| " %s | %s" % r }.join("\n")

Short Long
Cc Control
Cf Format
Cn Unassigned
Co Private_Use
Cs Surrogate
LC Cased_Letter
Ll Lowercase_Letter
Lm Modifier_Letter
Lo Other_Letter
Lt Titlecase_Letter
Lu Uppercase_Letter
Mc Spacing_Mark
Me Enclosing_Mark
Mn Nonspacing_Mark
Nd Decimal_Number
Nl Letter_Number
No Other_Number
Pc Connector_Punctuation
Pd Dash_Punctuation
Pe Close_Punctuation
Pf Final_Punctuation
Pi Initial_Punctuation
Po Other_Punctuation
Ps Open_Punctuation
Sc Currency_Symbol
Sk Modifier_Symbol
Sm Math_Symbol
So Other_Symbol
Zl Line_Separator
Zp Paragraph_Separator
Zs Space_Separator

See unicode-x for more Unicode related micro libraries.

MIT License

  • Copyright (C) 2016-2022 Jan Lelis https://janlelis.com. Released under the MIT license.
  • Unicode data: https://www.unicode.org/copyright.html#Exhibit1