Class: Moab::FileSignature
- Inherits:
-
Serializer::Serializable
- Object
- Serializer::Serializable
- Moab::FileSignature
- Includes:
- HappyMapper
- Defined in:
- lib/moab/file_signature.rb
Overview
Copyright © 2012 by The Board of Trustees of the Leland Stanford Junior University. All rights reserved. See LICENSE for details.
The fixity properties of a file, used to determine file content equivalence regardless of filename. Placing this data in a class by itself facilitates using file size together with the MD5 and SHA1 checksums as a single key when doing comparisons against other file instances. The Moab design assumes that this file signature is sufficiently unique to act as a comparator for determining file equality and eliminating file redundancy.
The use of signatures for a compare-by-hash mechanism introduces a miniscule (but non-zero) risk that two non-identical files will have the same checksum. While this risk is only about 1 in 1048 when using the SHA1 checksum alone, it can be reduced even further (to about 1 in 1086) if we use the MD5 and SHA1 checksums together. And we gain a bit more comfort by including a comparison of file sizes.
Finally, the “collision” risk is reduced by isolation of each digital object’s file pool within an object folder, instead of in a common storage area shared by the whole repository.
Data Model
-
FileInventory = container for recording information about a collection of related files
-
FileGroup [1..*] = subset allow segregation of content and metadata files
-
FileManifestation [1..*] = snapshot of a file’s filesystem characteristics
-
FileSignature [1] = file fixity information
-
FileInstance [1..*] = filepath and timestamp of any physical file having that signature
-
-
-
-
SignatureCatalog = lookup table containing a cumulative collection of all files ever ingested
-
SignatureCatalogEntry [1..*] = an row in the lookup table containing storage information about a single file
-
FileSignature [1] = file fixity information
-
-
-
FileInventoryDifference = compares two FileInventory instances based on file signatures and pathnames
-
FileGroupDifference [1..*] = performs analysis and reports differences between two matching FileGroup objects
-
FileGroupDifferenceSubset [1..5] = collects a set of file-level differences of a give change type
-
FileInstanceDifference [1..*] = contains difference information at the file level
-
FileSignature [1..2] = contains the file signature(s) of two file instances being compared
-
-
-
-
Constant Summary collapse
- KNOWN_ALGOS =
{ md5: proc { Digest::MD5.new }, sha1: proc { Digest::SHA1.new }, sha256: proc { Digest::SHA2.new(256) } }.freeze
Instance Attribute Summary collapse
-
#md5 ⇒ String
The MD5 checksum value of the file.
-
#sha1 ⇒ String
The SHA1 checksum value of the file.
-
#sha256 ⇒ String
The SHA256 checksum value of the file.
-
#size ⇒ Integer
The size of the file in bytes.
Class Method Summary collapse
- .active_algos ⇒ Object
-
.checksum_names_for_type ⇒ Hash<Symbol,String>
Key is type (e.g. :sha1), value is checksum names (e.g. [‘SHA-1’, ‘SHA1’]).
-
.checksum_type_for_name ⇒ Hash<String, Symbol>
Key is checksum name (e.g. MD5), value is checksum type (e.g. :md5).
-
.from_file(pathname, algos_to_use = active_algos) ⇒ Moab::FileSignature
Reads the file once for ALL (requested) algorithms, not once per.
Instance Method Summary collapse
-
#==(other) ⇒ Object
(see #eql?).
-
#checksums ⇒ Hash<Symbol,String>
A hash of the checksum data.
-
#complete? ⇒ Boolean
The signature contains all of the 3 desired checksums.
-
#eql?(other) ⇒ Boolean
Returns true if self and other have comparable fixity data.
-
#fixity ⇒ Hash<Symbol => String>
A hash of fixity data from this signataure object.
-
#hash ⇒ Fixnum
Compute a hash-code for the fixity value array.
-
#normalized_signature(pathname) ⇒ FileSignature
The full signature derived from the file, unless the fixity is inconsistent with current values.
-
#set_checksum(type, value) ⇒ void
Set the value of the specified checksum type.
- #signature_from_file(pathname) ⇒ FileSignature deprecated Deprecated.
Methods inherited from Serializer::Serializable
#array_to_hash, deep_diff, #diff, #initialize, #key, #key_name, #summary, #to_hash, #to_json, #to_yaml, #variable_names, #variables
Constructor Details
This class inherits a constructor from Serializer::Serializable
Instance Attribute Details
#md5 ⇒ String
Returns The MD5 checksum value of the file.
51 |
# File 'lib/moab/file_signature.rb', line 51 attribute :md5, String, :on_save => proc { |n| n.nil? ? "" : n.to_s } |
#sha1 ⇒ String
Returns The SHA1 checksum value of the file.
55 |
# File 'lib/moab/file_signature.rb', line 55 attribute :sha1, String, :on_save => proc { |n| n.nil? ? "" : n.to_s } |
#sha256 ⇒ String
Returns The SHA256 checksum value of the file.
59 |
# File 'lib/moab/file_signature.rb', line 59 attribute :sha256, String, :on_save => proc { |n| n.nil? ? "" : n.to_s } |
#size ⇒ Integer
Returns The size of the file in bytes.
47 |
# File 'lib/moab/file_signature.rb', line 47 attribute :size, Integer, :on_save => proc { |n| n.to_s } |
Class Method Details
.active_algos ⇒ Object
67 68 69 |
# File 'lib/moab/file_signature.rb', line 67 def self.active_algos Moab::Config.checksum_algos end |
.checksum_names_for_type ⇒ Hash<Symbol,String>
Returns Key is type (e.g. :sha1), value is checksum names (e.g. [‘SHA-1’, ‘SHA1’]).
185 186 187 188 189 190 191 |
# File 'lib/moab/file_signature.rb', line 185 def self.checksum_names_for_type { md5: ['MD5'], sha1: ['SHA-1', 'SHA1'], sha256: ['SHA-256', 'SHA256'] } end |
.checksum_type_for_name ⇒ Hash<String, Symbol>
Returns Key is checksum name (e.g. MD5), value is checksum type (e.g. :md5).
194 195 196 197 198 199 200 |
# File 'lib/moab/file_signature.rb', line 194 def self.checksum_type_for_name type_for_name = {} checksum_names_for_type.each do |type, names| names.each { |name| type_for_name[name] = type } end type_for_name end |
.from_file(pathname, algos_to_use = active_algos) ⇒ Moab::FileSignature
Reads the file once for ALL (requested) algorithms, not once per.
75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/moab/file_signature.rb', line 75 def self.from_file(pathname, algos_to_use = active_algos) raise 'Unrecognized algorithm requested' unless algos_to_use.all? { |a| KNOWN_ALGOS.include?(a) } signatures = algos_to_use.map { |k| [k, KNOWN_ALGOS[k].call] }.to_h pathname.open("r") do |stream| while (buffer = stream.read(8192)) signatures.each_value { |digest| digest.update(buffer) } end end new(signatures.map { |k, digest| [k, digest.hexdigest] }.to_h.merge(size: pathname.size)) end |
Instance Method Details
#==(other) ⇒ Object
(see #eql?)
143 144 145 |
# File 'lib/moab/file_signature.rb', line 143 def ==(other) eql?(other) end |
#checksums ⇒ Hash<Symbol,String>
Returns A hash of the checksum data.
106 107 108 109 110 111 112 |
# File 'lib/moab/file_signature.rb', line 106 def checksums { md5: md5, sha1: sha1, sha256: sha256 }.reject { |_key, value| value.nil? || value.empty? } end |
#complete? ⇒ Boolean
Returns The signature contains all of the 3 desired checksums.
115 116 117 |
# File 'lib/moab/file_signature.rb', line 115 def complete? checksums.size == 3 end |
#eql?(other) ⇒ Boolean
Returns true if self and other have comparable fixity data.
128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/moab/file_signature.rb', line 128 def eql?(other) return false unless other.respond_to?(:size) && other.respond_to?(:checksums) return false if size.to_i != other.size.to_i self_checksums = checksums other_checksums = other.checksums matching_keys = self_checksums.keys & other_checksums.keys return false if matching_keys.empty? matching_keys.each do |key| return false if self_checksums[key] != other_checksums[key] end true end |
#fixity ⇒ Hash<Symbol => String>
Returns A hash of fixity data from this signataure object.
121 122 123 |
# File 'lib/moab/file_signature.rb', line 121 def fixity { size: size.to_s }.merge(checksums) end |
#hash ⇒ Fixnum
Returns Compute a hash-code for the fixity value array. Two file instances with the same content will have the same hash code (and will compare using eql?).
155 156 157 |
# File 'lib/moab/file_signature.rb', line 155 def hash @size.to_i end |
#normalized_signature(pathname) ⇒ FileSignature
Returns The full signature derived from the file, unless the fixity is inconsistent with current values.
176 177 178 179 180 181 182 |
# File 'lib/moab/file_signature.rb', line 176 def normalized_signature(pathname) sig_from_file = FileSignature.new.signature_from_file(pathname) return sig_from_file if eql?(sig_from_file) # The full signature from file is consistent with current values, or... # One or more of the fixity values is inconsistent, so raise an exception raise "Signature inconsistent between inventory and file for #{pathname}: #{diff(sig_from_file).inspect}" end |
#set_checksum(type, value) ⇒ void
This method returns an undefined value.
Returns Set the value of the specified checksum type.
92 93 94 95 96 97 98 99 100 101 102 103 |
# File 'lib/moab/file_signature.rb', line 92 def set_checksum(type, value) case type.to_s.downcase.to_sym when :md5 @md5 = value when :sha1 @sha1 = value when :sha256 @sha256 = value else raise ArgumentError, "Unknown checksum type '#{type}'" end end |
#signature_from_file(pathname) ⇒ FileSignature
this method is a holdover from an earlier version. use the class method .from_file going forward.
164 165 166 167 168 169 170 171 |
# File 'lib/moab/file_signature.rb', line 164 def signature_from_file(pathname) file_signature = self.class.from_file(pathname) self.size = file_signature.size self.md5 = file_signature.md5 self.sha1 = file_signature.sha1 self.sha256 = file_signature.sha256 self end |