Module: UnicodeNamecode::DataLoader

Defined in:: lib/unicode_namecode/data_loader.rb

Overview

Handles data loading, caching, and parallel parsing

Constant Summary collapse

DATA_PATH =

File.expand_path('../../../data/UnicodeData.txt', __FILE__)

CACHE_PATH =

File.expand_path('../../../data/unicode_trie.cache', __FILE__)

Class Attribute Summary collapse

.all_names ⇒ Object readonly

Returns the value of attribute all_names.
.codepoint_to_name ⇒ Object readonly

Returns the value of attribute codepoint_to_name.
.fuzzy ⇒ Object readonly

Returns the value of attribute fuzzy.
.trie ⇒ Object readonly

Returns the value of attribute trie.

Class Method Summary collapse

.collect_all_names ⇒ Object

Collect all Unicode names from the Trie for fuzzy matching.
.collect_codepoint_to_name(node, current) ⇒ Object

Build the reverse lookup map: codepoint -> Unicode name.
.collect_names_recursive(node, current, names) ⇒ Object

Recursively traverse the Trie to collect all complete Unicode names.
.load_data ⇒ Object

Main data loading method - handles cache loading and fresh parsing.

Class Attribute Details

.all_names ⇒ `Object` (readonly)

Returns the value of attribute all_names.



19
20
21

# File 'lib/unicode_namecode/data_loader.rb', line 19

def all_names
  @all_names
end

.codepoint_to_name ⇒ `Object` (readonly)

Returns the value of attribute codepoint_to_name.



19
20
21

# File 'lib/unicode_namecode/data_loader.rb', line 19

def codepoint_to_name
  @codepoint_to_name
end

.fuzzy ⇒ `Object` (readonly)

Returns the value of attribute fuzzy.



19
20
21

# File 'lib/unicode_namecode/data_loader.rb', line 19

def fuzzy
  @fuzzy
end

.trie ⇒ `Object` (readonly)

Returns the value of attribute trie.



19
20
21

# File 'lib/unicode_namecode/data_loader.rb', line 19

def trie
  @trie
end

Class Method Details

.collect_all_names ⇒ `Object`

Collect all Unicode names from the Trie for fuzzy matching

# File 'lib/unicode_namecode/data_loader.rb', line 72

def collect_all_names
  names = []
  collect_names_recursive(@trie.instance_variable_get(:@root), "", names)
  names
end

.collect_codepoint_to_name(node, current) ⇒ `Object`

Build the reverse lookup map: codepoint -> Unicode name

# File 'lib/unicode_namecode/data_loader.rb', line 87

def collect_codepoint_to_name(node, current)
  if node.is_end && node.codepoint
    @codepoint_to_name[node.codepoint] = current.upcase
  end
  node.children.each do |char, child|
    collect_codepoint_to_name(child, current + char)
  end
end

.collect_names_recursive(node, current, names) ⇒ `Object`

Recursively traverse the Trie to collect all complete Unicode names

# File 'lib/unicode_namecode/data_loader.rb', line 79

def collect_names_recursive(node, current, names)
  names << current if node.is_end
  node.children.each do |char, child|
    collect_names_recursive(child, current + char, names)
  end
end

.load_data ⇒ `Object`

Main data loading method - handles cache loading and fresh parsing

# File 'lib/unicode_namecode/data_loader.rb', line 22

def load_data
  if File.exist?(CACHE_PATH)
    File.open(CACHE_PATH, 'rb') { |f| @trie = Marshal.load(f) }
    @all_names = collect_all_names
    @fuzzy = FuzzyMatch.new(@all_names)
    @codepoint_to_name = {}
    collect_codepoint_to_name(@trie.instance_variable_get(:@root), "")
    return
  end
  
  # First run: parse UnicodeData.txt and build everything from scratch
  @trie = Trie.new
  @codepoint_to_name = {}
  
  # Use parallel parsing to speed up the initial load
  lines = File.readlines(DATA_PATH)
  n_threads = [Etc.nprocessors, 2].max
  chunk_size = (lines.size.to_f / n_threads).ceil
  chunks = lines.each_slice(chunk_size).to_a
  results = Array.new(n_threads) { [] }
  
  # Parse chunks in parallel threads
  threads = chunks.each_with_index.map do |chunk, idx|
    Thread.new do
      chunk.each do |line|
        fields = line.chomp.split(';')
        codepoint = fields[0]
        name = fields[1]
        next if name =~ /<.*>/
        
        results[idx] << [name.upcase, codepoint.to_i(16)]
        @codepoint_to_name[codepoint.to_i(16)] = name.upcase
      end
    end
  end
  
  threads.each(&:join)
  
  # Insert all parsed data into the Trie
  results.flatten(1).each { |name, codepoint| @trie.insert(name, codepoint) }
  
  # Cache the built Trie for future fast loads
  File.open(CACHE_PATH, 'wb') { |f| Marshal.dump(@trie, f) }
  
  # Build additional data structures
  @all_names = collect_all_names
  @fuzzy = FuzzyMatch.new(@all_names)
end

Module: UnicodeNamecode::DataLoader

Overview

Constant Summary collapse

Class Attribute Summary collapse

Class Method Summary collapse

Class Attribute Details

.all_names ⇒ Object (readonly)

.codepoint_to_name ⇒ Object (readonly)

.fuzzy ⇒ Object (readonly)

.trie ⇒ Object (readonly)

Class Method Details

.collect_all_names ⇒ Object

.collect_codepoint_to_name(node, current) ⇒ Object

.collect_names_recursive(node, current, names) ⇒ Object

.load_data ⇒ Object