Module: WenlinDbScanner::Chars
- Defined in:
- lib/wenlin_db_scanner/chars.rb
Overview
Parses the data in the character (hanzi) databases.
Class Method Summary collapse
-
._hanzi(db_file) ⇒ Enumerator<Hash>
Decoder for a CDL database.
-
._hz_en(db_file) ⇒ Enumerator<DictEntry>
Decodeder for a database of hanzi -> English meaning entries.
-
.cdl_attribute_value(key, raw_value) ⇒ Integer, ...
Decodes known attributes for CDL XML elements.
-
.hanzi(db_root) ⇒ Enumerator<Hash>
The entries in the database that breaks down hanzi into components.
-
.hz_en(db_root) ⇒ Enumerator<CharMeaning>
The entries in the hanzi -> English meaning dictionary.
-
.pinyin_to_latin(pinyin) ⇒ String
Removes the accents from a pinyin string.
Class Method Details
._hanzi(db_file) ⇒ Enumerator<Hash>
Decoder for a CDL database.
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# File 'lib/wenlin_db_scanner/chars.rb', line 21 def self._hanzi(db_file) Enumerator.new do |yielder| db = Db.new db_file db.records.each do |record| next if record.binary? xml = REXML::Document.new record.text entry = {} xml.root.attributes.each do |name, raw_value| key = name.to_sym entry[key] = cdl_attribute_value key, raw_value end entry[:parts] = xml.root.elements.map do |element| part = { part: element.name.to_sym } element.attributes.each do |name, raw_value| key = name.to_sym part[key] = cdl_attribute_value key, raw_value end part end yielder << entry end end end |
._hz_en(db_file) ⇒ Enumerator<DictEntry>
Decodeder for a database of hanzi -> English meaning entries.
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# File 'lib/wenlin_db_scanner/chars.rb', line 82 def self._hz_en(db_file) Enumerator.new do |yielder| db = Db.new db_file db.records.each do |record| next if record.binary? lines = record.text.split("\n").map(&:strip).reject(&:empty?) header = lines[0] entry = CharMeaning.new entry.char = header[0, 1] header = header[1..-1] entry. = header.scan(/\[([^\]]*)\]/). map { |match| match.first.strip } entry. = entry..map { || } header.gsub!(/\[[^\]]*\]/, '') header.strip! header.scan(/\([^\)]+\)/).each do |aside| aside_text = aside[1...-1] case aside_text[0] when '=' entry.variants = aside_text[1..-1].chars. reject { |c| c.codepoints.first < 128 } header.gsub! aside, '' when '!', '?' entry. ||= [] entry. += aside_text[1..-1].chars. reject { |c| c.codepoints.first < 128 } header.gsub! aside, '' when 'F' entry.complex_forms = aside_text[1..-1].chars. reject { |c| c.codepoints.first < 128 } header.gsub! aside, '' when 'S' entry.simplified_forms = aside_text[1..-1].chars. reject { |c| c.codepoints.first < 128 } header.gsub! aside, '' when 'u', 'U' if /^Unihan/i =~ aside_text header.gsub! aside, '' end end end header.strip! # Many definitions start with a (note). if note_match = /^\(([^\)]*)\)/.match(header) entry.note = note_match[1] header = header[note_match[0].length..-1].strip end entry.meaning = header.gsub(/\s*<hr\s*\/?>\s*/, "\n") lines[1..-1].each do |line| unless line[0] == ?# if entry.note entry.note << "/ #{line}" else entry.note = line end next end tag, data = line[1], line[2..-1].strip case 'tag' when 'c' entry.components = data.chars. reject { |c| c.codepoints.first < 128 } when 'r' # NOTE: skipping remarks when 'y' entry.cantonese = data end end yielder << entry end end end |
.cdl_attribute_value(key, raw_value) ⇒ Integer, ...
Decodes known attributes for CDL XML elements.
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/wenlin_db_scanner/chars.rb', line 53 def self.cdl_attribute_value(key, raw_value) case key when :points # coordinates raw_value.split(' ').map do |pair| pair.split(',').map { |coord| coord.strip.to_i } end when :radical # dictionary radicals? raw_value.strip.split(' ').map(&:strip) when :type # stroke type raw_value.strip.to_sym when :uni # unicode value raw_value.strip.to_i(16) else raw_value.strip end end |
.hanzi(db_root) ⇒ Enumerator<Hash>
The entries in the database that breaks down hanzi into components.
13 14 15 |
# File 'lib/wenlin_db_scanner/chars.rb', line 13 def self.hanzi(db_root) _hanzi File.join(db_root, 'cdl.db') end |
.hz_en(db_root) ⇒ Enumerator<CharMeaning>
The entries in the hanzi -> English meaning dictionary.
74 75 76 |
# File 'lib/wenlin_db_scanner/chars.rb', line 74 def self.hz_en(db_root) _hz_en File.join(db_root, 'zidian.db') end |
.pinyin_to_latin(pinyin) ⇒ String
Removes the accents from a pinyin string.
This computes the closest Latin alphabet string matching the given pinyin string. It is what users will most likely type to refer to the character, word or phrase inside the pinyin-spelling string.
172 173 174 175 |
# File 'lib/wenlin_db_scanner/chars.rb', line 172 def self.() .tr 'āēīōūǖĀĒĪŌŪǕáéíóúǘÁÉÍÓÚǗǎěǐǒǔǚǍĚǏǑǓǙàèìòùǜÀÈÌÒÙǛüÜ', 'aeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVvV' end |