Yosina Ruby

A Ruby port of the Yosina Japanese text transliteration library.

Overview

Yosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.

Usage

require 'yosina'

# Create a transliterator using a recipe
recipe = Yosina::TransliterationRecipe.new(
  replace_spaces: true,
  kanji_old_new: true,
  replace_circled_or_squared_characters: true,
  replace_combined_characters: true,
  hira_kata: "hira-to-kata",  # Convert hiragana to katakana
  replace_japanese_iteration_marks: true,  # Expand iteration marks
  to_fullwidth: true
)

transliterator = Yosina.make_transliterator(recipe)

# Use it to transliterate text with various special characters
input = "①②③ ⒶⒷⒸ ㍿㍑㌠㋿" # circled numbers, letters, ideographic space, combined characters
result = transliterator.call(input)
puts result # "(1)(2)(3) (A)(B)(C) 株式会社リットルサンチーム令和"

# Convert old kanji to new
old_kanji = "舊字體"
result = transliterator.call(old_kanji)
puts result # "旧字体"

# Convert half-width katakana to full-width
half_width = "テストモジレツ"
result = transliterator.call(half_width)
puts result # "テストモジレツ"

# Demonstrate hiragana to katakana conversion with iteration marks
mixed_text = "学問のすゝめ"
result = transliterator.call(mixed_text)
puts result # "学問ノススメ"

Advanced Usage with Configs

require 'yosina'

# Create transliterator with specific configurations
configs = [
  Yosina::TransliteratorConfig.new('spaces'),
  Yosina::TransliteratorConfig.new('kanji-old-new'),
  Yosina::TransliteratorConfig.new('radicals'),
  Yosina::TransliteratorConfig.new('circled-or-squared'),
  Yosina::TransliteratorConfig.new('combined'),
  Yosina::TransliteratorConfig.new('hira-kata', { mode: 'kata-to-hira' }),  # Convert katakana to hiragana
  Yosina::TransliteratorConfig.new('japanese-iteration-marks')  # Expand iteration marks like 々, ゝゞ, ヽヾ
]

transliterator = Yosina.make_transliterator(configs)

# Example with various transformations including the new ones
input_text = "カタカナでの時々の佐々木さん"
result = transliterator.call(input_text)
puts result # "かたかなでの時時の佐佐木さん"

Using String Names

require 'yosina'

# Simplified configuration with string names
configs = ['spaces', 'kanji-old-new', 'radicals']

transliterator = Yosina.make_transliterator(configs)
result = transliterator.call("some japanese text")
puts result

Using Individual Transliterators

require 'yosina'

# Create a circled-or-squared transliterator
circled_factory = Yosina::Transliterators::CircledOrSquared
circled_transliterator = circled_factory.call

chars = Yosina::Chars.build_char_array("①②③ⒶⒷⒸ")
result_chars = circled_transliterator.call(chars)
output = Yosina::Chars.from_chars(result_chars)
puts output # "123ABC"

# Create a combined transliterator
combined_factory = Yosina::Transliterators::Combined
combined_transliterator = combined_factory.call

chars2 = Yosina::Chars.build_char_array("㍿㍑㌠㋿") # combined characters
result_chars2 = combined_transliterator.call(chars2)
output2 = Yosina::Chars.from_chars(result_chars2)
puts output2 # "株式会社リットルサンチーム令和"

Requirements

  • Ruby 3.0 and later versions

Installation

Add this line to your application's Gemfile:

gem 'yosina'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install yosina

Available Transliterators

1. Circled or Squared (circled-or-squared)

Converts circled or squared characters to their plain equivalents.

  • Options: templates (custom rendering), includeEmojis (include emoji characters)
  • Example: ①②③(1)(2)(3), ㊙㊗(秘)(祝)

2. Combined (combined)

Expands combined characters into their individual character sequences.

  • Example: (Heisei era) → 平成, (株)

3. Hiragana-Katakana Composition (hira-kata-composition)

Combines decomposed hiraganas and katakanas into composed equivalents.

  • Options: composeNonCombiningMarks (compose non-combining marks)
  • Example: か + ゙, ヘ + ゜

4. Hiragana-Katakana (hira-kata)

Converts between hiragana and katakana scripts bidirectionally.

  • Options: mode ("hira-to-kata" or "kata-to-hira")
  • Example: ひらがなヒラガナ (hira-to-kata)

5. Hyphens (hyphens)

Replaces various dash/hyphen symbols with common ones used in Japanese.

  • Options: precedence (mapping priority order)
  • Available mappings: "ascii", "jisx0201", "jisx0208_90", "jisx0208_90_windows", "jisx0208_verbatim"
  • Example: 2019—2020 (em dash) → 2019-2020

6. Ideographic Annotations (ideographic-annotations)

Replaces ideographic annotations used in traditional Chinese-to-Japanese translation.

  • Example: ㆖㆘上下

7. IVS-SVS Base (ivs-svs-base)

Handles Ideographic and Standardized Variation Selectors.

  • Options: charset, mode ("ivs-or-svs" or "base"), preferSVS, dropSelectorsAltogether
  • Example: 葛󠄀 (葛 + IVS) →

8. Japanese Iteration Marks (japanese-iteration-marks)

Expands iteration marks by repeating the preceding character.

  • Example: 時々時時, いすゞいすず

9. JIS X 0201 and Alike (jisx0201-and-alike)

Handles half-width/full-width character conversion.

  • Options: fullwidthToHalfwidth, convertGL (alphanumerics/symbols), convertGR (katakana), u005cAsYenSign
  • Example: ABC123ABC123, カタカナカタカナ

10. Kanji Old-New (kanji-old-new)

Converts old-style kanji (旧字体) to modern forms (新字体).

  • Example: 舊字體の變換旧字体の変換

11. Mathematical Alphanumerics (mathematical-alphanumerics)

Normalizes mathematical alphanumeric symbols to plain ASCII.

  • Example: 𝐀𝐁𝐂 (mathematical bold) → ABC

12. Prolonged Sound Marks (prolonged-sound-marks)

Handles contextual conversion between hyphens and prolonged sound marks.

  • Options: skipAlreadyTransliteratedChars, allowProlongedHatsuon, allowProlongedSokuon, replaceProlongedMarksFollowingAlnums
  • Example: イ−ハト−ヴォ (with hyphen) → イーハトーヴォ (prolonged mark)

13. Radicals (radicals)

Converts CJK radical characters to their corresponding ideographs.

  • Example: ⾔⾨⾷ (Kangxi radicals) → 言門食

14. Spaces (spaces)

Normalizes various Unicode space characters to standard ASCII space.

  • Example: A B (ideographic space) → A B

15. Roman Numerals (roman-numerals)

Converts Unicode Roman numeral characters to their ASCII letter equivalents.

  • Example: Ⅰ Ⅱ ⅢI II III, ⅰ ⅱ ⅲi ii iii

Development

After checking out the repo, run bundle install to install dependencies.

Code Generation

Some transliterators are generated from data files:

rake codegen

This generates transliterators from the JSON data files in the ../data directory.

Testing

Run the test suite with:

rake test

Or run specific tests:

ruby test/test_basic.rb

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/yosina-lib/yosina.

License

The gem is available as open source under the terms of the MIT License.