CharacterSet

Gem Version Build Status

A gem to build, read, write and compare sets of Unicode codepoints.

Many parts can be used independently, e.g.:

  • CharacterSet::Character
  • CharacterSet::Parser
  • CharacterSet::Writer
  • RangeCompressor

Usage

Parse/Initialize

These all produce a CharacterSet containing a, b and c:

CharacterSet['a', 'b', 'c']
CharacterSet[97, 98, 99]
CharacterSet.new('a'..'c')
CharacterSet.new(0x61..0x63)
CharacterSet.of('abacababa')
CharacterSet.parse('[a-c]')
CharacterSet.parse('\U00000061-\U00000063')

If the gems regexp_parser and regexp_property_values are installed, ::of_regexp and ::of_property can also be used. ::of_regexp can handle intersections, negations, and set nesting:

# are there any non-digit ascii chars classified as emoji?
set = CharacterSet.of_regexp(/[\D&&[:ascii:]&&\p{emoji}]/)

# ... of course there are!
set.to_a(stringify: true) # => ["#", "*"]

# with the core extension:
require 'character_set/core_ext/regexp_ext'
/[a-e&&[^c]]/.character_set # => CharacterSet['a', 'b', 'd', 'e']

Common utility sets

CharacterSet.ascii
CharacterSet.bmp
CharacterSet.crypt
CharacterSet.emoji
CharacterSet.newline
CharacterSet.unicode
CharacterSet.url_fragment
CharacterSet.url_host
CharacterSet.url_path
CharacterSet.url_query
CharacterSet.whitespace

# e.g.
CharacterSet.url_query.cover?('?a=(b$c;)') # => true
CharacterSet.emoji.sample(5) # => ["⛷", "👈", "🌞", "♑", "⛈"]

# all can be prefixed with `non_`, e.g.
CharacterSet.non_ascii.delete_in(string)

Interact with Strings

CharacterSet can replace some Regexp actions on Strings, at better speed (see benchmarks).

#used_by? and #cover? can replace some Regexp#match? calls:

CharacterSet.ascii.used_by?('Tüür') # => true
CharacterSet.ascii.cover?('Tüür') # => false
CharacterSet.ascii.cover?('Tr') # => true

#delete_in(!) and #keep_in(!) can replace String#gsub(!) and the like:

string = 'Tüür'

CharacterSet.ascii.delete_in(string) # => 'üü'
CharacterSet.ascii.keep_in(string) # => 'Tr'
string # => 'Tüür'

CharacterSet.ascii.delete_in!(string) # => 'üü'
string # => 'üü'
CharacterSet.ascii.keep_in!(string) # => ''
string # => ''

There is also a core extension for String interaction.

require 'character_set/core_ext/string_ext'

"a\rb".character_set & CharacterSet.newline # => CharacterSet["\r"]
"a\rb".uses_character_set?(CharacterSet.emoji) # => false
"a\rb".covered_by_character_set?(CharacterSet.newline) # => false
"a\rb".delete_character_set(CharacterSet.newline) # => 'ab'
# etc.

Manipulate

Use any Ruby Set method, e.g. #+, #-, #&, #^, #intersect?, #<, #> etc. to interact with other sets. Use #add, #delete, #include? etc. to change or check for members.

Where appropriate, methods take both chars and codepoints, e.g.:

CharacterSet['a'].add('b') # => CharacterSet['a', 'b']
CharacterSet['a'].add(98) # => CharacterSet['a', 'b']
CharacterSet['a'].include?('a') # => true
CharacterSet['a'].include?(0x61) # => true

#inversion can be used to create a CharacterSet with all valid Unicode codepoints that are not in the current set:

non_a = CharacterSet['a'].inversion
# => #<CharacterSet (size: 1112063)>

non_a.include?('a') # => false
non_a.include?('ü') # => true

# surrogate pair halves are not included by default
CharacterSet['a'].inversion(include_surrogates: true)
# => #<CharacterSet (size: 1114111)>

#case_insensitive can be used to create a CharacterSet where upper/lower case codepoints are supplemented:

CharacterSet['1', 'a'].case_insensitive # => CharacterSet['1', 'A', 'a']

Write

set = CharacterSet['a', 'b', 'c', 'j', '-']

# safely printable ASCII chars are not escaped by default
set.to_s # => 'a-cj\x2D'
set.to_s(escape_all: true) # => '\x61-\x63\x6A\x2D'

# brackets may be added
set.to_s(in_brackets: true) # => '[a-cj\x2D]'

# the default escape format is Ruby/ES6 compatible, others are available
set = CharacterSet['a', 'b', 'c', 'ɘ', '🤩']
set.to_s # => 'a-c\u0258\u{1F929}'
set.to_s(format: 'U+') # => 'a-cU+0258U+1F929'
set.to_s(format: 'Python') # => "a-c\u0258\U0001F929"
set.to_s(format: 'raw') # => 'a-cɘ🤩'

# or pass a block
set.to_s { |char| "[#{char.codepoint}]" } # => "a-c[600][129321]"
set.to_s(escape_all: true) { |c| "<#{c.hex}>" } # => "<61>-<63><258><1F929>"

# disable abbreviation (grouping of codepoints in ranges)
set.to_s(abbreviate: false) # => "abc\u0258\u{1F929}"

# for full js regex compatibility in case of astral members:
set.to_s_with_surrogate_alternation # => '(?:[\u0258]|\ud83e\udd29)'

Unicode plane methods

There are some methods to check for planes and to handle BMP and astral parts:

CharacterSet['a', 'ü', '🤩'].bmp_part # => CharacterSet['a', 'ü']
CharacterSet['a', 'ü', '🤩'].astral_part # => CharacterSet['🤩']
CharacterSet['a', 'ü', '🤩'].bmp_ratio # => 0.6666666
CharacterSet['a', 'ü', '🤩'].planes # => [0, 1]
CharacterSet['a', 'ü', '🤩'].member_in_plane?(7) # => false
CharacterSet::Character.new('a').plane # => 0

Contributions

Feel free to send suggestions, point out issues, or submit pull requests.