Command to print out Unicode characters that satisfy expressions for Ruby Regexp

Summary

This package contains a standalone Ruby (executable) source file bin/ruby_unicode_prop - a command to be used from terminals etc, and it outputs (to STDOUT) Unicode characters and/or their hexagonal codepoints that satisfy one or more given expressions defined in Regexp in Ruby. Specifically, they are p{XXX} -type expressions (e.g., p{Katakana}) for Unicode, as well as [[:blank:]] -type expressions for POSIX representation.

Some supplementary files are found in the top and test directories, none of which is essential to run the command.

Description

The help doc is viewable with -h (or --help) option, which all the basics:

% /YOUR/INSTALLED/PATH/ruby_unicode_prop -h
USAGE: ruby_unicode_prop [options] Property1 [Property2, ...]
  Print all the characters and/or their hex-codepoints that have
  the given "Unicode property" used in Ruby Regexp like \p{Currency_Symbol}
  (or POSIX expression like [[:blank:]] if -p option is given).

Options:
    -c, --[no-]without-codepoint     Print characters only? (Def: false)
    -n, --[no-]without-char          Print codepoints only? (Def: false)
    -d, --delimiter=CHAR             Delimeter in output.
    -l, --[no-]lowercase             Lower cases alphabets are used for Hex in codepoints (Def: false)
    -p, --[no-]posix                 Use POSIX expression instead of Unicode (Def: false)
        --[no-]list-property         Print all the Ruby Unicode properties and exit.

Note1: Delimeter means one
  (1) between multiple characters and codepoints if either of -n or -c is specified
      (Default: Null for -c (characters only) and a new line for -n.
  (2) between the number and character of each pair if both are specified
      (Def: a whitespace), whereas the delimeter between pairs is always a newline.
  To specify a newline as a delimiter, give 'NL'
Note2: Properties differ for '-p', 'ascii' in POSIX and 'ASCII' in Unicode.

The reference file (used in the -l option) is dynamically retrieved from github.com/k-takata/Onigmo/blob/master/doc/UnicodeProps.txt The definition file in the Ruby source tree is at /enc/unicode/name2ctype.h

Limitations

The output of this command is generated by the Ruby it runs, and hence is fully consistent with Regexp matching results with the same property names in any applications when you run the same Ruby. That also means the output can depend on the version of the Ruby you run, because the unicode table has expanded over the years (such as emojis) and it will keep doing so.

Currently, the searches by this command is limited up to the second Supplementary Plane (Supplementary Ideographic Plane), which should be enough in practice in most cases now in 2019 and perhaps will be so for some time.

In fact, in many practical cases, searching over only the Basic Multilingual Plane (up to 0xFFFF) is probably sufficient, though it seems the second Supplementary Plane does include groups of CJK characters some of which are still in use occasionally in modern days. The maximum codepoint to search for is defined in the constant MAX_UNICODE_HEX near the beginning of the source code. If you set it to a lower value, that can speed up the processing considerably, potentially noticeably.

Examples

A typical example is as follows:

% bin/ruby_unicode_prop Greek
0370 Ͱ
0371 ͱ
0372 Ͳ
……(snipped)……
0391 Α
0392 Β
0393 Γ
0394 Δ
……(snipped)

For some, POSIX (bracket) expressions are supported:

% bin/ruby_unicode_prop -p -d '___ ' punct
0021___ !
0022___ "
0023___ #
0024___ $
0025___ %
0026___ &
0027___ '
0028___ (
0029___ )
002A___ *
002B___ +
……(snipped")

Note the corresponding property name for Unicode p{} (a backslash followed by p and curly brackets) is Punct — it is capitalized, compared with the POSIX expression name.

Or, you can specify multiple properties. The order of the argument does not matter and the result is always in the order of the codepoints. No duplication is produced, even if some of the specified properties have overlapped ranges of characters. An example is,

% bin/ruby_unicode_prop -c Number Terminal_Punctuation
!,.0123456789:;?²³¹¼½¾;……(snipped)
% bin/ruby_unicode_prop -c Number Terminal_Punctuation Close_Punctuation
!),.0123456789:;?]}²³¹¼½¾;……(snipped)

Install

This script requires Ruby Version 2.0 or above.

If you install it as the standard Ruby Gem package, the executable bin/ruby_unicode_prop should be located automatically in your command-line search path.

If not, place (copy) it in any of your command-line search paths. It is a self-contained single file and does not need any external optional library except the standard library that come in default with Ruby 2.0.

You may need to modify the first line (Shebang line) of the script to suit your environment (it should be unnecessary for Linux and macOS), or run it explicitly with your Ruby command as

Prompt% /YOUR/ENV/ruby /YOUR/INSTALLED/ruby_unicode_prop

Developer’s note

The master of this README file as well as the entire package is found in RubyGems/ruby_unicode_prop

The source code is maintained also in Github

Tests

Ruby codes under the directory test/ are the test scripts. You can run them from the top directory as ruby test/test_****.rb or simply run make test.

Known bugs and Todo items

None.

Copyright

Author: Masa Sakano < info a_t wisebabel dot com >
Versions: The versions of this package follow Semantic Versioning (2.0.0) semver.org/
License: MIT