Module: ZhongwenTools::Romanization::Pinyin
- Defined in:
- lib/zhongwen_tools/romanization/pinyin.rb
Overview
Public: methods to convert, detect and split pinyin or
pyn ( with numbers, e.g. hao3).
Class Method Summary collapse
- .add_hyphens_to_pyn(str) ⇒ Object
- .are_all_pyn_syllables_complete?(pyn_arr) ⇒ Boolean
- .capitalized?(str) ⇒ Boolean
-
.convert_pinyin_to_pyn(pinyin) ⇒ Object
Internal: converts real pinyin to pinyin number string.
-
.convert_pyn_to_pinyin(str) ⇒ Object
Internal: Replaces numbered pinyin with actual pinyin.
- .current_pyn(pyn, pinyin_arr) ⇒ Object
- .find_py(str) ⇒ Object
- .normalize_n(pinyin) ⇒ Object
-
.normalize_n_g(pinyin) ⇒ Object
NOTE: Special Case split_py(“fǎnguāng”) # => [“fǎn” + “guāng”] In pinyin, sāngēng == sān gēng and sāng’ēng = sāng ēng.
- .normalize_pinyin(pinyin) ⇒ Object
- .not_hyphen_regex ⇒ Object
- .pinyin_replacement(py) ⇒ Object
-
.py?(str) ⇒ Boolean
Public: checks if a string is pinyin.
- .py_type(romanization) ⇒ Object
-
.pyn?(str) ⇒ Boolean
Public: checks if a string is pinyin.
- .pyn_matches_properly?(pyn_arr, normalized_str) ⇒ Boolean
- .recapitalize(obj, capitalized) ⇒ Object
- .select_pinyin_match(matches) ⇒ Object
- .simple_tone_numbers ⇒ Object
- .split_py(str) ⇒ Object
- .split_pyn(str) ⇒ Object
Class Method Details
.add_hyphens_to_pyn(str) ⇒ Object
108 109 110 111 112 113 114 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 108 def self.add_hyphens_to_pyn(str) results = str.split(' ').map do |s| split_pyn(s).join('-') end results.join(' ') end |
.are_all_pyn_syllables_complete?(pyn_arr) ⇒ Boolean
128 129 130 131 132 133 134 135 136 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 128 def self.are_all_pyn_syllables_complete?(pyn_arr) pyns = ROMANIZATIONS_TABLE.map { |r| r[:pyn] } + PYN_SYLLABIC_NASALS pyn_syllables = pyn_arr.select do |p| pyns.include?(p.gsub(/[1-5]/, '')) end pyn_arr.size == pyn_syllables.size end |
.capitalized?(str) ⇒ Boolean
203 204 205 206 207 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 203 def self.capitalized?(str) first_letter = str[ZhongwenTools::Regex.][0] first_letter != Caps.downcase(first_letter) end |
.convert_pinyin_to_pyn(pinyin) ⇒ Object
Internal: converts real pinyin to pinyin number string.
pinyin - A String for the pinyin.
Examples
('Nǐ hǎo ma') #=> 'Ni3 hao3 ma5?'
Returns a String in pinyin number format.
188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 188 def self.() words = .split(' ') pyn = words.map do |word| # NOTE: if a word is upcase, then it will be converted the same # as a word that is only capitalized. word, is_capitalized = (word) pys = split_py(word) recapitalize(current_pyn(word, pys), is_capitalized) end pyn.join(' ') end |
.convert_pyn_to_pinyin(str) ⇒ Object
Internal: Replaces numbered pinyin with actual pinyin. Pinyin separated with hyphens are combined as one word.
str - A String to replace with actual
Examples
'Ni3 hao3 ma5?' # => "Nǐ hǎo ma?"
Returns a string with actual
245 246 247 248 249 250 251 252 253 254 255 256 257 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 245 def self.(str) regex = Regex. # NOTE: Using gsub is ~8x faster than using scan and each. # NOTE: if it's pinyin without vowels, e.g. m, ng, then convert, # otherwise, check if it needs an apostrophe (http://www.pinyin.info/romanization/hanyu/apostrophes.html). # If it does, add it and then convert. Otherwise, just convert it. # Oh, and if it has double hyphens, replace with one hyphen. # And finally, correct those apostrophes at the very end. # It's like magic. str.gsub(regex) do ($3.nil? ? "#{ PYN_PY[$1] }" : ($2 == '' && %w(a e o).include?($3[0, 1])) ? "'#{ PYN_PY["#{ $3 }#{ $6 }"]}#{ $4 }#{ $5 }" : "#{ $2 }#{ PYN_PY["#{ $3 }#{ $6 }"] }#{ $4 }#{ $5 }") + (($7.to_s.length > 1) ? '-' : '') end.gsub("-'", '-').sub(/^'/, '').gsub(" '", ' ') end |
.current_pyn(pyn, pinyin_arr) ⇒ Object
209 210 211 212 213 214 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 209 def self.current_pyn(pyn, ) replace = {} .map { || replace[] = () } pyn.gsub(/#{pinyin_arr.join('|')}/, replace).gsub("''", '') end |
.find_py(str) ⇒ Object
164 165 166 167 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 164 def self.find_py(str) regex = ZhongwenTools::Regex.find_py_regex str.scan(regex).map { |x| x.compact[0] } end |
.normalize_n(pinyin) ⇒ Object
153 154 155 156 157 158 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 153 def self.normalize_n() # Special Case split_py("yìnián") # => ["yì" + "nián"] # split_py("Xīní") # => ["Xī", "ní"] regex = /#{Regex.only_tones}(n(#{Regex.py_tones['v']}|#{Regex.py_tones['i']}|[iu]|#{Regex.py_tones['e']}|[#{Regex.py_tones['a']}]))/ .gsub(regex) { "#{$1}-#{$2}" } end |
.normalize_n_g(pinyin) ⇒ Object
NOTE: Special Case split_py(“fǎnguāng”) # => [“fǎn” + “guāng”]
In , s
146 147 148 149 150 151 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 146 def self.normalize_n_g() regex = /(?<n_part>n)(?<g_part>g(#{Regex.py_tones['o']}|#{Regex.py_tones['u']}|#{Regex.py_tones['a']}|#{Regex.py_tones['e']}))/ .gsub(regex) do "#{Regexp.last_match[:n_part]}-#{Regexp.last_match[:g_part]}" end end |
.normalize_pinyin(pinyin) ⇒ Object
160 161 162 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 160 def self.() [Caps.downcase(), capitalized?()] end |
.not_hyphen_regex ⇒ Object
120 121 122 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 120 def self.not_hyphen_regex @not_hyphen_regex ||= /[^\-]*/ end |
.pinyin_replacement(py) ⇒ Object
216 217 218 219 220 221 222 223 224 225 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 216 def self.(py) matches = PYN_PY.values.select do |x| py.include? x end match = (matches) replace = PYN_PY.find { |k, v| k if v == match }[0] py.gsub(match, replace).gsub(/([^\d ]*)(\d)([^\d ]*)/) { $1 + $3 + $2 } end |
.py?(str) ⇒ Boolean
Public: checks if a string is pinyin.
http://en.wikipedia.org/wiki/
Examples
py?('nǐ hǎo')
# => true
Returns Boolean.
77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 77 def self.py?(str) if str[Regex.only_tones].nil? && str[/[1-5]/].nil? pyn?(str) else # TODO: py regex does not include capitals with tones. # NOTE: Special Case "fǎnguāng" should be "fǎn" + "guāng" regex = /(#{ Regex.punc }|#{ Regex.py }|#{ Regex.py_syllabic_nasals }|[\s\-])/ str = str.gsub('ngu', 'n-gu') Caps.downcase(str).gsub(regex, '').strip == '' end end |
.py_type(romanization) ⇒ Object
138 139 140 141 142 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 138 def self.py_type(romanization) romanization = romanization.to_s.downcase.to_sym { pyn: :pyn, py: :py, pinyin: :py }[romanization] end |
.pyn?(str) ⇒ Boolean
Public: checks if a string is pinyin.
Examples
pyn?('pin1-yin1')
# => true
Returns Boolean.
97 98 99 100 101 102 103 104 105 106 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 97 def self.pyn?(str) return false if str =~ /a{2,}|e{2,}|i{2,}|o{2,}|u{2,}/ # FIXME: use strip_punctuation method, e.g. gsub(/\p{Punct}/, '') normalized_str = Caps.downcase(str.gsub(Regex.punc, '').gsub(/[\s\-]/, '')) pyn_arr = split_pyn(normalized_str).map { |p| p } pyn_arr << normalized_str if pyn_arr.size == 0 && PYN_SYLLABIC_NASALS.include?(normalized_str.gsub(/[1-5]/, '')) pyn_matches_properly?(pyn_arr, normalized_str) && are_all_pyn_syllables_complete?(pyn_arr) end |
.pyn_matches_properly?(pyn_arr, normalized_str) ⇒ Boolean
124 125 126 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 124 def self.pyn_matches_properly?(pyn_arr, normalized_str) pyn_arr.join('') == normalized_str end |
.recapitalize(obj, capitalized) ⇒ Object
169 170 171 172 173 174 175 176 177 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 169 def self.recapitalize(obj, capitalized) return obj unless capitalized if obj.is_a? String Caps.capitalize(obj) elsif obj.is_a? Array [Caps.capitalize(obj[0]), obj[1..-1]].flatten end end |
.select_pinyin_match(matches) ⇒ Object
227 228 229 230 231 232 233 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 227 def self.(matches) # take the longest pinyin match. Use bytes because 'è' is prefered over 'n' or 'r' or 'm' match = matches.sort { |x, y| x.bytes.to_a.length <=> y.bytes.to_a.length }[-1] # Edge case.. en/eng pyn -> py conversion is one way only. match[/^(ē|é|ě|è|e)n?g?/].nil? ? match : match.chars[0] end |
.simple_tone_numbers ⇒ Object
116 117 118 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 116 def self.simple_tone_numbers @simple_tone_numbers ||= /[1-5]/ end |
.split_py(str) ⇒ Object
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 51 def self.split_py(str) words = str.split(' ') words.flat_map do |word| word, is_capitalized = (word) word = normalize_n_g(word) word = normalize_n(word) result = word.split(/['\-]/).flatten.map do |x| find_py(x) end # NOTE: Special Case split_py('wányìr') # => ['wán', 'yì', 'r'] result << 'r' unless word[/(.*[^#{ Regex.py_tones['e'] }.])(r)$/].nil? recapitalize(result.flatten, is_capitalized) end end |
.split_pyn(str) ⇒ Object
39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/zhongwen_tools/romanization/pinyin.rb', line 39 def self.split_pyn(str) # NOTE: This methods is called quite frequently. Unfortunately, it was # slower than it needed to be. After looking into several # optimizations, I ended up settling on one that cached the Regexp # creation. # FIXME: ignore punctuation regex = str[simple_tone_numbers].nil? ? Regex. : Regex.pyn_and_pynt # NOTE: Fast Ruby: p[/[^\-]*/].to_s is 25% faster than gsub('-', '') strip_regex = not_hyphen_regex str.scan(regex).flat_map { |arr| arr[0].strip[strip_regex].to_s } end |