Module: WenlinDbScanner::Chars

Defined in:
lib/wenlin_db_scanner/chars.rb

Overview

Parses the data in the character (hanzi) databases.

Class Method Summary collapse

Class Method Details

._hanzi(db_file) ⇒ Enumerator<Hash>

Decoder for a CDL database.

Parameters:

  • db_file (String)

    path to the .db file containing CDL data

Returns:

  • (Enumerator<Hash>)


21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# File 'lib/wenlin_db_scanner/chars.rb', line 21

def self._hanzi(db_file)
  Enumerator.new do |yielder|
    db = Db.new db_file
    db.records.each do |record|
      next if record.binary?
      xml = REXML::Document.new record.text

      entry = {}
      xml.root.attributes.each do |name, raw_value|
        key = name.to_sym
        entry[key] = cdl_attribute_value key, raw_value
      end

      entry[:parts] = xml.root.elements.map do |element|
        part = { part: element.name.to_sym }
        element.attributes.each do |name, raw_value|
          key = name.to_sym
          part[key] = cdl_attribute_value key, raw_value
        end
        part
      end

      yielder << entry
    end
  end
end

._hz_en(db_file) ⇒ Enumerator<DictEntry>

Decodeder for a database of hanzi -> English meaning entries.

Parameters:

  • db_file (String)

    path to the .db file containing dictionary data

Returns:



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# File 'lib/wenlin_db_scanner/chars.rb', line 82

def self._hz_en(db_file)
  Enumerator.new do |yielder|
    db = Db.new db_file
    db.records.each do |record|
      next if record.binary?
      lines = record.text.split("\n").map(&:strip).reject(&:empty?)

      header = lines[0]

      entry = CharMeaning.new
      entry.char = header[0, 1]
      header = header[1..-1]

      entry.pinyin = header.scan(/\[([^\]]*)\]/).
                            map { |match| match.first.strip }
      entry.latin_pinyin =
          entry.pinyin.map { |pinyin| pinyin_to_latin pinyin }
      header.gsub!(/\[[^\]]*\]/, '')
      header.strip!

      header.scan(/\([^\)]+\)/).each do |aside|
        aside_text = aside[1...-1]
        case aside_text[0]
        when '='
          entry.variants = aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when '!', '?'
          entry.related ||= []
          entry.related += aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when 'F'
          entry.complex_forms = aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when 'S'
          entry.simplified_forms = aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when 'u', 'U'
          if /^Unihan/i =~ aside_text
            header.gsub! aside, ''
          end
        end
      end
      header.strip!
      # Many definitions start with a (note).
      if note_match = /^\(([^\)]*)\)/.match(header)
        entry.note = note_match[1]
        header = header[note_match[0].length..-1].strip
      end
      entry.meaning = header.gsub(/\s*<hr\s*\/?>\s*/, "\n")

      lines[1..-1].each do |line|
        unless line[0] == ?#
          if entry.note
            entry.note << "/ #{line}"
          else
            entry.note = line
          end
          next
        end

        tag, data = line[1], line[2..-1].strip
        case 'tag'
        when 'c'
          entry.components = data.chars.
              reject { |c| c.codepoints.first < 128 }
        when 'r'
          # NOTE: skipping remarks
        when 'y'
          entry.cantonese = data
        end
      end

      yielder << entry
    end
  end
end

.cdl_attribute_value(key, raw_value) ⇒ Integer, ...

Decodes known attributes for CDL XML elements.

Parameters:

  • key (Symbol)

    the attribute’s name, symbolized

  • value (String)

    the attribute’s value

Returns:

  • (Integer, Array, String)

    a more programmer-friendly value



53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/wenlin_db_scanner/chars.rb', line 53

def self.cdl_attribute_value(key, raw_value)
  case key
  when :points  # coordinates
    raw_value.split(' ').map do |pair|
      pair.split(',').map { |coord| coord.strip.to_i }
    end
  when :radical  # dictionary radicals?
    raw_value.strip.split(' ').map(&:strip)
  when :type  # stroke type
    raw_value.strip.to_sym
  when :uni  # unicode value
    raw_value.strip.to_i(16)
  else
    raw_value.strip
  end
end

.hanzi(db_root) ⇒ Enumerator<Hash>

The entries in the database that breaks down hanzi into components.

Parameters:

  • db_root (String)

    the directory containing the .db files

Returns:

  • (Enumerator<Hash>)


13
14
15
# File 'lib/wenlin_db_scanner/chars.rb', line 13

def self.hanzi(db_root)
  _hanzi File.join(db_root, 'cdl.db')
end

.hz_en(db_root) ⇒ Enumerator<CharMeaning>

The entries in the hanzi -> English meaning dictionary.

Parameters:

  • db_root (String)

    the directory containing the .db files

Returns:



74
75
76
# File 'lib/wenlin_db_scanner/chars.rb', line 74

def self.hz_en(db_root)
  _hz_en File.join(db_root, 'zidian.db')
end

.pinyin_to_latin(pinyin) ⇒ String

Removes the accents from a pinyin string.

This computes the closest Latin alphabet string matching the given pinyin string. It is what users will most likely type to refer to the character, word or phrase inside the pinyin-spelling string.

Parameters:

  • pinyin (String)

    a string that uses pinyin spelling

Returns:

  • (String)

    the closest approximation to the given string that only uses Latin characters



172
173
174
175
# File 'lib/wenlin_db_scanner/chars.rb', line 172

def self.pinyin_to_latin(pinyin)
  pinyin.tr 'āēīōūǖĀĒĪŌŪǕáéíóúǘÁÉÍÓÚǗǎěǐǒǔǚǍĚǏǑǓǙàèìòùǜÀÈÌÒÙǛüÜ',
            'aeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVvV'
end