script_detector

This is a simple utility library for trying to figure out what CJK script a string is in. Three core methods that extend String are provided:

chinese?

Returns true if the string contains Chinese characters and no Japanese or Korean characters

japanese?

Returns true if the string contains specifically Japanese (hiragana or katakana) characters

korean?

Returns true if the string contains specifically Korean (hangul) characters

Once a script has been identified as Chinese, two further methods are provided for determining the script subtype:

traditional_chinese?

Return true if the string contains traditional Chinese characters (繁體字)

simplified_chinese?

Return true if the string contains simplified Chinese characters (简体字)

There is also a helper method that combines these to produce human-readable output:

identify_script

Try to detect script and return one of “Japanese”, “Korean”, “Traditional Chinese”, “Simplified Chinese”, “Ambiguous Chinese” or “Unknown”

It is important to understand that this requires long sections of text to work reliably, since a single character or even several characters may be valid Japanese, traditional Chinese and simplified Chinese simultaneously. Likewise, the string 東京 (Tokyo) will return “false” for Japanese and “true” for traditional Chinese, since those two kanji are also valid traditional Chinese.

Details: unicode.org/faq/han_cjk.html#4

Example

> p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
> string.japanese?
=> false
> string.korean?
=> false
> string.identify_script
=> "Traditional Chinese"

Implementation

Ruby 1.9 Oniguruma regular expressions are used to determine which script is in use. The lists of simplified and traditional Chinese characters have been drawn from the Unihan database’s Unihan_Variants.txt data set, using the assumption that any character with a kTraditionalVariant is simplified and visa versa.

Contributing to script_detector

  • Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet.

  • Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it.

  • Fork the project.

  • Start a feature/bugfix branch.

  • Commit and push until you are happy with your contribution.

  • Make sure to add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Copyright © 2012 Jani Patokallio. See LICENSE.txt for further details.