Class: Traject::Marc4JReader
- Inherits:
-
Object
- Object
- Traject::Marc4JReader
- Includes:
- Enumerable
- Defined in:
- lib/traject/marc4j_reader.rb
Overview
Uses Marc4J to read the marc records, but then translates them to ruby-marc before delivering them still, Marc4J is just inside the black box.
But one way to get ability to transcode from Marc8. Records it delivers are ALWAYS in UTF8, will be transcoded if needed.
Also hope it gives us some performance benefit.
Uses the Marc4J MarcPermissiveStreamReader for binary, but sometimes in non-permissive mode, according to settings. Uses the Marc4j MarcXmlReader for xml.
NOTE: If you aren’t reading in binary records encoded in MARC8, you may find the pure-ruby Traject::MarcReader faster; the extra step to read Marc4J but translate to ruby MARC::Record adds some overhead.
Settings:
-
marc_source.type: serialization type. default ‘binary’, also ‘xml’ (TODO: json/marc-in-json)
-
marc4j_reader.permissive: default true, false to turn off permissive reading. Used as
value to 'permissive' arg of MarcPermissiveStreamReader constructor. Only used for 'binary' -
marc4j_reader.source_encoding: Only used for ‘binary’, otherwise always UTF-8.
String of the values MarcPermissiveStreamReader accepts: * BESTGUESS (tries to use MARC leader and believe it, I think) * ISO8859_1 * UTF-8 * MARC8 Default 'BESTGUESS', but marc records in the wild are so wrong here, recommend setting. (will ALWAYS be transcoded to UTF-8 on the way out. We insist.) -
marc4j_reader.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar’s in dir will
be loaded. If unset, uses marc4j.jar bundled with traject.
Instance Attribute Summary collapse
-
#input_stream ⇒ Object
readonly
Returns the value of attribute input_stream.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #convert_marc4j_to_rubymarc(marc4j) ⇒ Object
- #create_marc_reader! ⇒ Object
- #each ⇒ Object
-
#ensure_marc4j_loaded! ⇒ Object
Loads solrj unless it appears to already be loaded.
-
#initialize(input_stream, settings) ⇒ Marc4JReader
constructor
A new instance of Marc4JReader.
- #input_type ⇒ Object
- #internal_reader ⇒ Object
- #logger ⇒ Object
Constructor Details
#initialize(input_stream, settings) ⇒ Marc4JReader
Returns a new instance of Marc4JReader.
45 46 47 48 49 50 |
# File 'lib/traject/marc4j_reader.rb', line 45 def initialize(input_stream, settings) @settings = Traject::Indexer::Settings.new settings @input_stream = input_stream ensure_marc4j_loaded! end |
Instance Attribute Details
#input_stream ⇒ Object (readonly)
Returns the value of attribute input_stream.
43 44 45 |
# File 'lib/traject/marc4j_reader.rb', line 43 def input_stream @input_stream end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
43 44 45 |
# File 'lib/traject/marc4j_reader.rb', line 43 def settings @settings end |
Instance Method Details
#convert_marc4j_to_rubymarc(marc4j) ⇒ Object
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# File 'lib/traject/marc4j_reader.rb', line 112 def convert_marc4j_to_rubymarc(marc4j) rmarc = MARC::Record.new rmarc.leader = marc4j.getLeader.marshal marc4j.getControlFields.each do |marc4j_control| rmarc.append( MARC::ControlField.new(marc4j_control.getTag(), marc4j_control.getData ) ) end marc4j.getDataFields.each do |marc4j_data| rdata = MARC::DataField.new( marc4j_data.getTag, marc4j_data.getIndicator1.chr, marc4j_data.getIndicator2.chr ) marc4j_data.getSubfields.each do |subfield| # We assume Marc21, skip corrupted data # if subfield.getCode is more than 255, subsequent .chr # would raise. if subfield.getCode > 255 logger.warn("Marc4JReader: Corrupted MARC data, record id #{marc4j.getControlNumber}, field #{marc4j_data.tag}, corrupt subfield code byte #{subfield.getCode}. Skipping subfield, but continuing with record.") next end rsubfield = MARC::Subfield.new(subfield.getCode.chr, subfield.getData) rdata.append rsubfield end rmarc.append rdata end return rmarc end |
#create_marc_reader! ⇒ Object
74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/traject/marc4j_reader.rb', line 74 def create_marc_reader! case input_type when "binary" permissive = settings["marc4j_reader.permissive"].to_s == "true" # #to_inputstream turns our ruby IO into a Java InputStream # third arg means 'convert to UTF-8, yes' MarcPermissiveStreamReader.new(input_stream.to_inputstream, permissive, true, settings["marc4j_reader.source_encoding"]) when "xml" MarcXmlReader.new(input_stream.to_inputstream) else raise IllegalArgument.new("Unrecgonized marc_source.type: #{input_type}") end end |
#each ⇒ Object
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/traject/marc4j_reader.rb', line 89 def each while (internal_reader.hasNext) begin marc4j = internal_reader.next rubymarc = convert_marc4j_to_rubymarc(marc4j) rescue Exception =>e msg = "MARC4JReader: Error reading MARC, fatal, re-raising" if marc4j msg += "\n 001 id: #{marc4j.getControlNumber}" end msg += "\n #{Traject::Util.(e)}" logger.fatal msg raise e end yield rubymarc end end |
#ensure_marc4j_loaded! ⇒ Object
Loads solrj unless it appears to already be loaded.
Will load from settings if given, otherwise bundled vendor location.
Will java_import MarcPermissiveStreamReader and MarcXmlReader so you have those available as un-namespaced classes.
59 60 61 62 63 |
# File 'lib/traject/marc4j_reader.rb', line 59 def ensure_marc4j_loaded! unless defined?(MarcPermissiveStreamReader) && defined?(MarcXmlReader) Traject::Util.require_marc4j_jars(settings) end end |
#input_type ⇒ Object
69 70 71 72 |
# File 'lib/traject/marc4j_reader.rb', line 69 def input_type # maybe later add some guessing somehow settings["marc_source.type"] end |
#internal_reader ⇒ Object
65 66 67 |
# File 'lib/traject/marc4j_reader.rb', line 65 def internal_reader @internal_reader ||= create_marc_reader! end |
#logger ⇒ Object
108 109 110 |
# File 'lib/traject/marc4j_reader.rb', line 108 def logger @logger ||= (settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger) end |