Class: Traject::MarcExtractor
- Inherits:
-
Object
- Object
- Traject::MarcExtractor
- Defined in:
- lib/traject/marc_extractor.rb
Overview
MarcExtractor is a class for extracting lists of strings from a MARC::Record, according to specifications. See #parse_string_spec for description of string string arguments used to specify extraction. See #initialize for options that can be set controlling extraction.
Examples:
array_of_stuff = MarcExtractor.new(marc_record, "001:245abc:700a").extract
values = MarcExtractor.new(marc_record, "040a", :seperator => nil).extract
Instance Attribute Summary collapse
-
#marc_record ⇒ Object
Returns the value of attribute marc_record.
-
#options ⇒ Object
Returns the value of attribute options.
-
#spec_hash ⇒ Object
Returns the value of attribute spec_hash.
Class Method Summary collapse
-
.extract_by_spec(marc_record, specification, options = {}) ⇒ Object
Convenience method to construct a MarcExtractor object and run extract on it.
-
.parse_string_spec(spec_string) ⇒ Object
Converts from a string marc spec like “245abc:700a” to a nested hash used internally to represent the specification.
Instance Method Summary collapse
-
#collect_subfields(field, spec) ⇒ Object
Pass in a marc data field and a hash spec, returns an ARRAY of one or more strings, subfields extracted and processed per spec.
- #control_field?(field) ⇒ Boolean
-
#each_matching_line ⇒ Object
Yields a block for every line in source record that matches spec.
-
#extract ⇒ Object
Returns array of strings, extracted values.
-
#initialize(marc_record, spec_hash, options = {}) ⇒ MarcExtractor
constructor
Take a hash that’s the output of #parse_string_spec, return an array of strings extracted from a marc record accordingly.
-
#matches_indicators(field, spec) ⇒ Object
a marc field, and an individual spec hash, => array, :indicators => array.
-
#spec_covering_field(field) ⇒ Object
Is there a spec covering extraction from this field? May return true on 880’s matching other tags depending on value of :alternate_script if :alternate_script is :only, will return original spec when field is an 880.
Constructor Details
#initialize(marc_record, spec_hash, options = {}) ⇒ MarcExtractor
Take a hash that’s the output of #parse_string_spec, return an array of strings extracted from a marc record accordingly
options:
- :seperator
-
default ‘ ’ (space), what to use to seperate subfield values when joining strings
- :alternate_script
-
default :include, include linked 880s for tags that match spec. Also:
-
false => do not include.
-
:only => only include linked 880s, not original
-
50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/traject/marc_extractor.rb', line 50 def initialize(marc_record, spec_hash, = {}) self. = { :seperator => ' ', :alternate_script => :include }.merge() raise IllegalArgumentException("second arg to MarcExtractor.new must be a Hash specification object") unless spec_hash.kind_of? Hash self.marc_record = marc_record self.spec_hash = spec_hash end |
Instance Attribute Details
#marc_record ⇒ Object
Returns the value of attribute marc_record.
15 16 17 |
# File 'lib/traject/marc_extractor.rb', line 15 def marc_record @marc_record end |
#options ⇒ Object
Returns the value of attribute options.
15 16 17 |
# File 'lib/traject/marc_extractor.rb', line 15 def end |
#spec_hash ⇒ Object
Returns the value of attribute spec_hash.
15 16 17 |
# File 'lib/traject/marc_extractor.rb', line 15 def spec_hash @spec_hash end |
Class Method Details
.extract_by_spec(marc_record, specification, options = {}) ⇒ Object
Convenience method to construct a MarcExtractor object and run extract on it.
First arg is a marc record.
Second arg is either a string that will be given to parse_string_spec, OR a hash that’s the return value of parse_string_spec.
Third arg is an optional options hash that will be passed as third arg of MarcExtractor constructor.
28 29 30 31 32 33 34 35 36 |
# File 'lib/traject/marc_extractor.rb', line 28 def self.extract_by_spec(marc_record, specification, = {}) (raise IllegalArgument, "first argument must not be nil") if marc_record.nil? unless specification.kind_of? Hash specification = self.parse_string_spec(specification) end Traject::MarcExtractor.new(marc_record, specification, ).extract end |
.parse_string_spec(spec_string) ⇒ Object
Converts from a string marc spec like “245abc:700a” to a nested hash used internally to represent the specification.
a String specification is a string of form:
{tag}{|indicators|}{subfields} seperated by colons
tag is three chars (usually but not neccesarily numeric), indicators are optional two chars prefixed by hyphen, subfields are optional list of chars (alphanumeric)
indicator spec must be two chars, but one can be * meaning “don’t care”. space to mean ‘blank’
“245|01|abc65:345abc:700|*5|:800”
Or, for control (fixed) fields (ordinarily fields 001-010), you can include a byte slice specification, but can NOT include subfield or indicator specifications. Plus can use special tag “LDR” for the marc leader. (TODO)
"008[35-37]:LDR[5]"
=> bytes 35-37 inclusive of field 008, and byte 5 of the marc leader.
Returns a nested hash keyed by tags. { tag => {
:subfields => ['a', 'b', '2'] # actually, a SET. may be empty or nil
:indicators => ['1', '0'] # An array. may be empty or nil; duple, either one can be nil
}
} For byte offsets, :bytes => 12 or :bytes => (7..10)
-
subfields and indicators can only be provided for marc data/variable fields
-
byte slice can only be provided for marc control fields (generally tags less than 010)
See tests for more examples.
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/traject/marc_extractor.rb', line 95 def self.parse_string_spec(spec_string) hash = {} spec_string.split(":").each do |part| if (part =~ /\A([a-zA-Z0-9]{3})(\|([a-z0-9\ \*]{2})\|)?([a-z0-9]*)?\Z/) # variable field tag, indicators, subfields = $1, $3, $4 hash[tag] ||= {} if subfields subfields.each_char do |subfield| hash[tag][:subfields] ||= Array.new hash[tag][:subfields] << subfield end end if indicators hash[tag][:indicators] = [ (indicators[0] if indicators[0] != "*"), (indicators[1] if indicators[1] != "*") ] end elsif (part =~ /\A([a-zA-Z0-9]{3})(\[(\d+)(-(\d+))?\])\Z/) # "005[4-5]" tag, byte1, byte2 = $1, $3, $5 hash[tag] ||= {} if byte1 && byte2 hash[tag][:bytes] = ((byte1.to_i)..(byte2.to_i)) elsif byte1 hash[tag][:bytes] = byte1.to_i end else raise ArgumentError.new("Unrecognized marc extract specification: #{part}") end end return hash end |
Instance Method Details
#collect_subfields(field, spec) ⇒ Object
Pass in a marc data field and a hash spec, returns an ARRAY of one or more strings, subfields extracted and processed per spec. Takes account of options such as :seperator
163 164 165 166 167 168 169 |
# File 'lib/traject/marc_extractor.rb', line 163 def collect_subfields(field, spec) subfields = field.subfields.collect do |subfield| subfield.value if spec[:subfields].nil? || spec[:subfields].include?(subfield.code) end.compact return [:seperator] ? [ subfields.join( [:seperator]) ] : subfields end |
#control_field?(field) ⇒ Boolean
192 193 194 195 196 |
# File 'lib/traject/marc_extractor.rb', line 192 def control_field?(field) # should the MARC gem have a more efficient way to do this, # define #control_field? on both ControlField and DataField? return field.kind_of? MARC::ControlField end |
#each_matching_line ⇒ Object
Yields a block for every line in source record that matches spec. First arg to block is MARC::Field (control or data), second is the hash specification that it matched on. May take account of options such as :alternate_script
151 152 153 154 155 156 157 |
# File 'lib/traject/marc_extractor.rb', line 151 def each_matching_line self.marc_record.each do |field| if (spec = spec_covering_field(field)) && matches_indicators(field, spec) yield(field, spec) end end end |
#extract ⇒ Object
Returns array of strings, extracted values
133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/traject/marc_extractor.rb', line 133 def extract results = [] self.each_matching_line do |field, spec| if control_field?(field) results << (spec[:bytes] ? field.value.byteslice(spec[:bytes]) : field.value) else results.concat collect_subfields(field, spec) end end return results end |
#matches_indicators(field, spec) ⇒ Object
a marc field, and an individual spec hash, => array, :indicators => array
199 200 201 202 203 204 |
# File 'lib/traject/marc_extractor.rb', line 199 def matches_indicators(field, spec) return true if spec[:indicators].nil? return (spec[:indicators][0].nil? || spec[:indicators][0] == field.indicator1) && (spec[:indicators][1].nil? || spec[:indicators][1] == field.indicator2) end |
#spec_covering_field(field) ⇒ Object
Is there a spec covering extraction from this field? May return true on 880’s matching other tags depending on value of :alternate_script if :alternate_script is :only, will return original spec when field is an 880. otherwise will always return nil for 880s, you have to handle :alternate_script :include elsewhere, to add in the 880 in the right order
177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# File 'lib/traject/marc_extractor.rb', line 177 def spec_covering_field(field) #require 'pry' #binding.pry if field.tag == "880" if field.tag == "880" && [:alternate_script] != false # pull out the spec for corresponding original marc tag this 880 corresponds to # Due to bug in jruby https://github.com/jruby/jruby/issues/886 , we need # to do this weird encode gymnastics, which fixes it for mysterious reasons. orig_field = field["6"].encode(field["6"].encoding).byteslice(0,3) field["6"] && self.spec_hash[ orig_field ] elsif [:alternate_script] != :only self.spec_hash[field.tag] end end |