Module: WenlinDbScanner::Chars

Defined in:: lib/wenlin_db_scanner/chars.rb

Overview

Parses the data in the character (hanzi) databases.

Class Method Summary collapse

._hanzi(db_file) ⇒ Enumerator<Hash>

Decoder for a CDL database.
._hz_en(db_file) ⇒ Enumerator<DictEntry>

Decodeder for a database of hanzi -> English meaning entries.
.cdl_attribute_value(key, raw_value) ⇒ Integer, ...

Decodes known attributes for CDL XML elements.
.hanzi(db_root) ⇒ Enumerator<Hash>

The entries in the database that breaks down hanzi into components.
.hz_en(db_root) ⇒ Enumerator<CharMeaning>

The entries in the hanzi -> English meaning dictionary.
.pinyin_to_latin(pinyin) ⇒ String

Removes the accents from a pinyin string.

Class Method Details

._hanzi(db_file) ⇒ `Enumerator<Hash>`

Decoder for a CDL database.

Parameters:

db_file (String) —

path to the .db file containing CDL data

Returns:

(Enumerator<Hash>)

# File 'lib/wenlin_db_scanner/chars.rb', line 21

def self._hanzi(db_file)
  Enumerator.new do |yielder|
    db = Db.new db_file
    db.records.each do |record|
      next if record.binary?
      xml = REXML::Document.new record.text

      entry = {}
      xml.root.attributes.each do |name, raw_value|
        key = name.to_sym
        entry[key] = cdl_attribute_value key, raw_value
      end

      entry[:parts] = xml.root.elements.map do |element|
        part = { part: element.name.to_sym }
        element.attributes.each do |name, raw_value|
          key = name.to_sym
          part[key] = cdl_attribute_value key, raw_value
        end
        part
      end

      yielder << entry
    end
  end
end

._hz_en(db_file) ⇒ `Enumerator<DictEntry>`

Decodeder for a database of hanzi -> English meaning entries.

Parameters:

db_file (String) —

path to the .db file containing dictionary data

Returns:

(Enumerator<DictEntry>)

# File 'lib/wenlin_db_scanner/chars.rb', line 82

def self._hz_en(db_file)
  Enumerator.new do |yielder|
    db = Db.new db_file
    db.records.each do |record|
      next if record.binary?
      lines = record.text.split("\n").map(&:strip).reject(&:empty?)

      header = lines[0]

      entry = CharMeaning.new
      entry.char = header[0, 1]
      header = header[1..-1]

      entry.pinyin = header.scan(/\[([^\]]*)\]/).
                            map { |match| match.first.strip }
      entry.latin_pinyin =
          entry.pinyin.map { |pinyin| pinyin_to_latin pinyin }
      header.gsub!(/\[[^\]]*\]/, '')
      header.strip!

      header.scan(/\([^\)]+\)/).each do |aside|
        aside_text = aside[1...-1]
        case aside_text[0]
        when '='
          entry.variants = aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when '!', '?'
          entry.related ||= []
          entry.related += aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when 'F'
          entry.complex_forms = aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when 'S'
          entry.simplified_forms = aside_text[1..-1].chars.
              reject { |c| c.codepoints.first < 128 }
          header.gsub! aside, ''
        when 'u', 'U'
          if /^Unihan/i =~ aside_text
            header.gsub! aside, ''
          end
        end
      end
      header.strip!
      # Many definitions start with a (note).
      if note_match = /^\(([^\)]*)\)/.match(header)
        entry.note = note_match[1]
        header = header[note_match[0].length..-1].strip
      end
      entry.meaning = header.gsub(/\s*<hr\s*\/?>\s*/, "\n")

      lines[1..-1].each do |line|
        unless line[0] == ?#
          if entry.note
            entry.note << "/ #{line}"
          else
            entry.note = line
          end
          next
        end

        tag, data = line[1], line[2..-1].strip
        case 'tag'
        when 'c'
          entry.components = data.chars.
              reject { |c| c.codepoints.first < 128 }
        when 'r'
          # NOTE: skipping remarks
        when 'y'
          entry.cantonese = data
        end
      end

      yielder << entry
    end
  end
end

.cdl_attribute_value(key, raw_value) ⇒ `Integer`, ...

Decodes known attributes for CDL XML elements.

Parameters:

key (Symbol) —

the attribute’s name, symbolized
value (String) —

the attribute’s value

Returns:

(Integer, Array, String) —

a more programmer-friendly value

# File 'lib/wenlin_db_scanner/chars.rb', line 53

def self.cdl_attribute_value(key, raw_value)
  case key
  when :points  # coordinates
    raw_value.split(' ').map do |pair|
      pair.split(',').map { |coord| coord.strip.to_i }
    end
  when :radical  # dictionary radicals?
    raw_value.strip.split(' ').map(&:strip)
  when :type  # stroke type
    raw_value.strip.to_sym
  when :uni  # unicode value
    raw_value.strip.to_i(16)
  else
    raw_value.strip
  end
end

.hanzi(db_root) ⇒ `Enumerator<Hash>`

The entries in the database that breaks down hanzi into components.

Parameters:

db_root (String) —

the directory containing the .db files

Returns:

(Enumerator<Hash>)



13
14
15

# File 'lib/wenlin_db_scanner/chars.rb', line 13

def self.hanzi(db_root)
  _hanzi File.join(db_root, 'cdl.db')
end

.hz_en(db_root) ⇒ `Enumerator<CharMeaning>`

The entries in the hanzi -> English meaning dictionary.

Parameters:

db_root (String) —

the directory containing the .db files

Returns:

(Enumerator<CharMeaning>)



74
75
76

# File 'lib/wenlin_db_scanner/chars.rb', line 74

def self.hz_en(db_root)
  _hz_en File.join(db_root, 'zidian.db')
end

.pinyin_to_latin(pinyin) ⇒ `String`

Removes the accents from a pinyin string.

This computes the closest Latin alphabet string matching the given pinyin string. It is what users will most likely type to refer to the character, word or phrase inside the pinyin-spelling string.

Parameters:

pinyin (String) —

a string that uses pinyin spelling

Returns:

(String) —

the closest approximation to the given string that only uses Latin characters

# File 'lib/wenlin_db_scanner/chars.rb', line 172

def self.pinyin_to_latin(pinyin)
  pinyin.tr 'āēīōūǖĀĒĪŌŪǕáéíóúǘÁÉÍÓÚǗǎěǐǒǔǚǍĚǏǑǓǙàèìòùǜÀÈÌÒÙǛüÜ',
            'aeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVvV'
end

Module: WenlinDbScanner::Chars

Overview

Class Method Summary collapse

Class Method Details

._hanzi(db_file) ⇒ Enumerator<Hash>

._hz_en(db_file) ⇒ Enumerator<DictEntry>

.cdl_attribute_value(key, raw_value) ⇒ Integer, ...

.hanzi(db_root) ⇒ Enumerator<Hash>

.hz_en(db_root) ⇒ Enumerator<CharMeaning>

.pinyin_to_latin(pinyin) ⇒ String

._hanzi(db_file) ⇒ `Enumerator<Hash>`

._hz_en(db_file) ⇒ `Enumerator<DictEntry>`

.cdl_attribute_value(key, raw_value) ⇒ `Integer`, ...

.hanzi(db_root) ⇒ `Enumerator<Hash>`

.hz_en(db_root) ⇒ `Enumerator<CharMeaning>`

.pinyin_to_latin(pinyin) ⇒ `String`