Class: PureCDB::Writer

Inherits:
Base
  • Object
show all
Defined in:
lib/purecdb/writer.rb

Overview

Write 32 or 64 bit CDB files

Memory considerations

While the entry is written to the target object immediately on calling #store, the actual hash tables can not be written until the full dataset is ready. You must therefore be able to hold the hash of each key (including duplicates) and the position in the file the full netry is stored at in memory while building the CDB file.

It would be possible to write this to a temporary file at the cost of performance, but the current implementation does not do this.

As a compromise, the current implementation stores the hashes and positions as a BER encoded string per hash bucket until it is ready to write it to disk.

Constant Summary

Constants inherited from Base

Base::CDB64_MAGIC, Base::DEFAULT_HASHPTR_SIZE, Base::DEFAULT_LENGTH_SIZE, Base::DEFAULT_NUM_HASHES

Instance Attribute Summary collapse

Attributes inherited from Base

#hashptr_size, #length_size, #mode, #num_hashes

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from Base

#hash, #hash_size, #hashref_size, #set_mode, #set_stream

Constructor Details

#initialize(target, *options) ⇒ Writer

Open a CDB file for writing, or preparing an IO like object for writing.

:call-seq:

w = PureCDB::Writer.new(target)
w = PureCDB::Writer.new(target, *options)
PureCDB::Writer.new(target)  {|w| ... }
PureCDB::Writer.new(target, *options) {|w| ... }

If :mode is passed in options, it must be the integers 32 or 64, indicating whether you wish to write a standard (32 bit) CDB file, or a 64 bit CDB-like file. The default is 32.

If target is a String it is treated as a filename of a file to be opened to write to. Otherwise target is assumed to be an IO-like object that ideally responds to #sysseek and #syswrite. If it doesn’t, it will be wrapped with an object delegating #sysseek and #syswrite to #seek and #write respectively, and these must be present.

(IO and StringIO both satisfy these requirements)

If passed a block, the writer is yielded to the block and PureCDB::Writer#close is called afterwards.

WARNING: To complete writing the hash tables, you must ensure #close is called when you are done.



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/purecdb/writer.rb', line 51

def initialize target, *options
  super *options

  @hash_fill_factor = 0.7

  set_mode(32) if @mode == :detect

  if target.is_a?(String)
    @io = File.new(target,"wb")
  else
    set_stream(target)
  end

  @hashes = [nil] * num_hashes

  @hashptrs = [0] * num_hashes * 2
  write_hashptrs

  @pos = hash_size

  if block_given?
    yield(self)
    close
    nil
  else
    self
  end
end

Instance Attribute Details

#hash_fill_factorObject

How full any given hash table is allowed to get, as a float between 0 and 1.

Needs to be <= 1. The lower it is, the fewer records will collide. The closer to 1 it is, the more frequently the reader may have to engage in potentially lengthy (worst case scanning all the records) probing to find the right entry



24
25
26
# File 'lib/purecdb/writer.rb', line 24

def hash_fill_factor
  @hash_fill_factor
end

Class Method Details

.open(target, *options, &block) ⇒ Object

Alternative to PureCDB::Writer.new(target,options) ..



109
110
111
# File 'lib/purecdb/writer.rb', line 109

def self.open target, *options, &block
  Writer.new(target, *options, &block)
end

Instance Method Details

#closeObject

Write out the hashes and hash pointers, and close the target if it responds to #close



82
83
84
85
86
# File 'lib/purecdb/writer.rb', line 82

def close
  write_hashes
  write_hashptrs
  @io.close if @io.respond_to?(:close)
end

#store(key, value) ⇒ Object

Store ‘value’ under ‘key’.

Multiple values can we stored for the same key by calling #store multiple times with the same key value.



92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/purecdb/writer.rb', line 92

def store key,value
  # In an attempt to save memory, we pack the hash data we gather into
  # strings of BER compressed integers...
  h = hash(key)
  hi = (h % num_hashes)
  @hashes[hi] ||= ""

  header = build_header(key.length, value.length)
  @io.syswrite(header+key+value)
  size = header.size + key.size + value.size
  @hashes[hi] += [h,@pos].pack("ww") # BER compressed
  @pos += size
end