Class: Traject::Marc4JReader

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/traject/marc4j_reader.rb

Overview

Uses Marc4J to read the marc records, but then translates them to ruby-marc before delivering them still, Marc4J is just inside the black box.

But one way to get ability to transcode from Marc8. Records it delivers are ALWAYS in UTF8, will be transcoded if needed.

Also hope it gives us some performance benefit.

Uses the Marc4J MarcPermissiveStreamReader for binary, but sometimes in non-permissive mode, according to settings. Uses the Marc4j MarcXmlReader for xml.

NOTE: If you aren’t reading in binary records encoded in MARC8, you may find the pure-ruby Traject::MarcReader faster; the extra step to read Marc4J but translate to ruby MARC::Record adds some overhead.

Settings:

  • marc_source.type: serialization type. default ‘binary’, also ‘xml’ (TODO: json/marc-in-json)

  • marc4j_reader.permissive: default true, false to turn off permissive reading. Used as

    value to 'permissive' arg of MarcPermissiveStreamReader constructor.
    Only used for 'binary'
    
  • marc4j_reader.source_encoding: Only used for ‘binary’, otherwise always UTF-8.

    String of the values MarcPermissiveStreamReader accepts:
    * BESTGUESS  (tries to use MARC leader and believe it, I think)
    * ISO8859_1
    * UTF-8
    * MARC8
    Default 'BESTGUESS', but marc records in the wild are so wrong here, recommend setting.
    (will ALWAYS be transcoded to UTF-8 on the way out. We insist.)
    
  • marc4j_reader.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar’s in dir will

    be loaded. If unset, uses marc4j.jar bundled with traject.
    

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input_stream, settings) ⇒ Marc4JReader

Returns a new instance of Marc4JReader.



45
46
47
48
49
50
# File 'lib/traject/marc4j_reader.rb', line 45

def initialize(input_stream, settings)
  @settings     = Traject::Indexer::Settings.new settings
  @input_stream = input_stream

  ensure_marc4j_loaded!
end

Instance Attribute Details

#input_streamObject (readonly)

Returns the value of attribute input_stream.



43
44
45
# File 'lib/traject/marc4j_reader.rb', line 43

def input_stream
  @input_stream
end

#settingsObject (readonly)

Returns the value of attribute settings.



43
44
45
# File 'lib/traject/marc4j_reader.rb', line 43

def settings
  @settings
end

Instance Method Details

#convert_marc4j_to_rubymarc(marc4j) ⇒ Object



112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/traject/marc4j_reader.rb', line 112

def convert_marc4j_to_rubymarc(marc4j)
  rmarc = MARC::Record.new
  rmarc.leader = marc4j.getLeader.marshal

  marc4j.getControlFields.each do |marc4j_control|
    rmarc.append( MARC::ControlField.new(marc4j_control.getTag(), marc4j_control.getData )  )
  end

  marc4j.getDataFields.each do |marc4j_data|
    rdata = MARC::DataField.new(  marc4j_data.getTag,  marc4j_data.getIndicator1.chr, marc4j_data.getIndicator2.chr )

    marc4j_data.getSubfields.each do |subfield|

      # We assume Marc21, skip corrupted data
      # if subfield.getCode is more than 255, subsequent .chr
      # would raise.
      if subfield.getCode > 255
        logger.warn("Marc4JReader: Corrupted MARC data, record id #{marc4j.getControlNumber}, field #{marc4j_data.tag}, corrupt subfield code byte #{subfield.getCode}. Skipping subfield, but continuing with record.")
        next
      end

      rsubfield = MARC::Subfield.new(subfield.getCode.chr, subfield.getData)
      rdata.append rsubfield
    end

    rmarc.append rdata
  end

  return rmarc
end

#create_marc_reader!Object



74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/traject/marc4j_reader.rb', line 74

def create_marc_reader!
  case input_type
  when "binary"
    permissive = settings["marc4j_reader.permissive"].to_s == "true"

    # #to_inputstream turns our ruby IO into a Java InputStream
    # third arg means 'convert to UTF-8, yes'
    MarcPermissiveStreamReader.new(input_stream.to_inputstream, permissive, true, settings["marc4j_reader.source_encoding"])
  when "xml"
    MarcXmlReader.new(input_stream.to_inputstream)
  else
    raise IllegalArgument.new("Unrecgonized marc_source.type: #{input_type}")
  end
end

#eachObject



89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/traject/marc4j_reader.rb', line 89

def each
  while (internal_reader.hasNext)
    begin
      marc4j = internal_reader.next
      rubymarc = convert_marc4j_to_rubymarc(marc4j)
    rescue Exception =>e
      msg = "MARC4JReader: Error reading MARC, fatal, re-raising"
      if marc4j
        msg += "\n    001 id: #{marc4j.getControlNumber}"
      end
      msg += "\n    #{Traject::Util.exception_to_log_message(e)}"
      logger.fatal msg
      raise e
    end

    yield rubymarc
  end
end

#ensure_marc4j_loaded!Object

Loads solrj unless it appears to already be loaded.

Will load from settings if given, otherwise bundled vendor location.

Will java_import MarcPermissiveStreamReader and MarcXmlReader so you have those available as un-namespaced classes.



59
60
61
62
63
# File 'lib/traject/marc4j_reader.rb', line 59

def ensure_marc4j_loaded!
  unless defined?(MarcPermissiveStreamReader) && defined?(MarcXmlReader)
    Traject::Util.require_marc4j_jars(settings)
  end
end

#input_typeObject



69
70
71
72
# File 'lib/traject/marc4j_reader.rb', line 69

def input_type
  # maybe later add some guessing somehow
  settings["marc_source.type"]
end

#internal_readerObject



65
66
67
# File 'lib/traject/marc4j_reader.rb', line 65

def internal_reader
  @internal_reader ||= create_marc_reader!
end

#loggerObject



108
109
110
# File 'lib/traject/marc4j_reader.rb', line 108

def logger
  @logger ||= (settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger)
end