Class: MARC::Reader

Inherits:

Object

Object
MARC::Reader

show all

Includes:: Enumerable

Defined in:: lib/marc/reader.rb

Overview

A class for reading MARC binary (ISO 2709) files.

Character Encoding

In ruby 1.9+, ruby tags all strings with expected character encodings. If illegal bytes for that character encoding are encountered in certain operations, ruby will raise an exception. If a String is incorrectly tagged with the wrong character encoding, that makes it fairly likely an illegal byte for the specified encoding will be encountered.

So when reading binary MARC data with the MARC::Reader, it’s important that you let it know the expected encoding:

MARC::Reader.new("path/to/file.mrc", :external_encoding => "UTF-8")

If you leave off ‘external_encoding’, it will use the ruby environment Encoding.default_external, which is usually UTF-8 but may depend on your environment.

Even if you expect your data to be (eg) UTF-8, it may include bad/illegal bytes. By default MARC::Reader will leave these in the produced Strings, which will probably raise an exception later in your program. Better to catch this early, and ask MARC::Reader to raise immediately on illegal bytes:

MARC::Reader.new("path/to/file.mrc", :external_encoding => "UTF-8", 
  :validate_encoding => true)

Alternately, you can have MARC::Reader replace illegal bytes with the Unicode Replacement Character, or with a string of your choice (including the empty string, meaning just omit the bad bytes)

MARC::Reader("path/to/file.mrc", :external_encoding => "UTF-8", 
   :invalid => :replace)
MARC::Reader("path/to/file.mrc", :external_encoding => "UTF-8", 
   :invalid => :replace, :replace => "")

If you supply an :external_encoding argument, MARC::Reader will always assume that encoding – if you leave it off, MARC::Reader will use the encoding tagged on any input you pass in, such as Strings or File handles.

# marc data will have same encoding as string.encoding:
MARC::Reader.decode( string )

# Same, values will have encoding of string.encoding:
MARC::Reader.new(StringIO.new(string)) 

# data values will have cp866 encoding, per external_encoding of
# File object passed in
MARC::Reader.new(File.new("myfile.marc", "r:cp866"))

# explicitly tell MARC::Reader the encoding
MARC::Reader.new("myfile.marc", :external_encoding => "cp866")

MARC-8

The legacy MARC-8 encoding needs to be handled differently, because there is no built-in support in ruby for MARC-8.

You can specify “MARC-8” as an external encoding. It will trigger trans-code to UTF-8 (NFC-normalized) in the internal ruby strings.

MARC::Reader.new("marc8.mrc", :external_encoding => "MARC-8")

For external_encoding “MARC-8”, :validate_encoding is always true, there’s no way to ignore bad bytes in MARC-8 when transcoding to unicode. However, just as with other encodings, the ‘:invalid => :replace` and `:replace => “string”` options can be used to replace bad bytes instead of raising.

If you want your MARC-8 to be transcoded internally to something other than UTF-8, you can use the :internal_encoding option which works with any encoding in MARC::Reader.

MARC::Reader.new("marc8.mrc", 
  :external_encoding => "MARC-8", 
  :internal_encoding => "UTF-16LE")

If you want to read in MARC-8 without transcoding, leaving the internal Strings in MARC-8, the only way to do that is with ruby’s ‘binary’ (aka “ASCII-8BIT”) encoding, since ruby doesn’t know from MARC-8. This will work:

MARC::Reader.new("marc8.mrc", :external_encoding => "binary")

Please note that MARC::Reader does not currently have any facilities for guessing encoding from MARC21 leader byte 9, that is ignored.

Complete Encoding Options

These options can all be used on MARC::Reader.new or MARC::Reader.decode to specify external encoding, ask for a transcode to a different encoding on read, or validate or replace bad bytes in source.

:external_encoding: What encoding to consider the MARC record’s values to be in. This option takes precedence over the File handle or String argument’s encodings.
:internal_encoding: Ask MARC::Reader to transcode to this encoding in memory after reading the file in.
:validate_encoding: If you pass in ‘true`, MARC::Reader will promise to raise an Encoding::InvalidByteSequenceError if there are illegal bytes in the source for the :external_encoding. There is a performance penalty for this check. Without this option, an exception may or _may not_ be raised, and whether an exception or raised (or what class the exception has) may change in future ruby-marc versions without warning.
:invalid: Just like String#encode, set to :replace and any bytes in source data illegal for the source encoding will be replaced with the unicode replacement character (when in unicode encodings), or else ‘?’. Overrides :validate_encoding. This can help you sanitize your input and avoid ruby “invalid UTF-8 byte” exceptions later.
:replace: Just like String#encode, combine with ‘:invalid=>:replace`, set your own replacement string for invalid bytes. You may use the empty string to simply eliminate invalid bytes.

Warning on ruby File’s own :internal_encoding, and unsafe transcoding from ruby

Be careful with using an explicit File object with the File’s own :internal_encoding set – it can cause ruby to transcode your data before MARC::Reader gets it, changing the bytecount and making the marc record unreadable in some cases. This applies to Encoding.default_encoding too!

# May in some cases result in unreadable marc and an exception 
MARC::Reader.new(  File.new("marc_in_cp866.mrc", "r:cp866:utf-8") )

# May in some cases result in unreadable marc and an exception
Encoding.default_internal = "utf-8"
MARC::Reader.new(  File.new("marc_in_cp866.mrc", "r:cp866") )

# However this should be safe:
MARC::Reader.new(  "marc_in_cp866.mrc", :external_encoding => "cp866")

# And this should be safe, if you do want to transcode:
MARC::Reader.new(  "marc_in_cp866.mrc", :external_encoding => "cp866",
   :internal_encoding => "utf-8")

# And this should ALWAYS be safe, with or without an internal_encoding
MARC::Reader.new( File.new("marc_in_cp866.mrc", "r:binary:binary"),
   :external_encoding => "cp866",
   :internal_encoding => "utf-8")

jruby note

In the past, jruby encoding-related bugs have caused problems with our encoding treatments. See for example: jira.codehaus.org/browse/JRUBY-6637

We recommend using the latest version of jruby, especially at least jruby 1.7.6.

Direct Known Subclasses

ForgivingReader

Class Method Summary collapse

.decode(marc, params = {}) ⇒ Object

A static method for turning raw MARC data in transission format into a MARC::Record object.
.set_encoding(str, params) ⇒ Object

input passed in probably has ‘binary’ encoding.

Instance Method Summary collapse

#decode(marc) ⇒ Object

Decodes the given string into a MARC::Record object.
#each ⇒ Object

to support iteration: for record in reader print record end.
#each_raw ⇒ Object

Iterates over each record as a raw String, rather than a decoded MARC::Record.
#initialize(file, options = {}) ⇒ Reader constructor

The constructor which you may pass either a path.

Constructor Details

#initialize(file, options = {}) ⇒ `Reader`

The constructor which you may pass either a path

reader = MARC::Reader.new('marc.dat')

or, if it’s more convenient a File object:

fh = File.new('marc.dat')
reader = MARC::Reader.new(fh)

or really any object that responds to read(n)

# marc is a string with a bunch of records in it
reader = MARC::Reader.new(StringIO.new(marc))

If your data have non-standard control fields in them (e.g., Aleph’s ‘FMT’) you need to add them specifically to the MARC::ControlField.control_tags Set object

MARC::ControlField.control_tags << 'FMT'

Also, if your data encoded with non ascii/utf-8 encoding (for ex. when reading RUSMARC data) and you use ruby 1.9 you can specify source data encoding with an option.

reader = MARC::Reader.new('marc.dat', :external_encoding => 'cp866')

or, you can pass IO, opened in the corresponding encoding

reader = MARC::Reader.new(File.new('marc.dat', 'r:cp866'))

# File 'lib/marc/reader.rb', line 192

def initialize(file, options = {})      
  @encoding_options = {}
  # all can be nil
  [:internal_encoding, :external_encoding, :invalid, :replace, :validate_encoding].each do |key|
    @encoding_options[key] = options[key] if options.has_key?(key)
  end
        
  if file.is_a?(String)        
    @handle = File.new(file)
  elsif file.respond_to?("read", 5)
    @handle = file
  else
    raise ArgumentError, "must pass in path or file"
  end
  
  if (! @encoding_options[:external_encoding] ) && @handle.respond_to?(:external_encoding)
    # use file encoding only if we didn't already have an explicit one,
    # explicit one takes precedence. 
    #
    # Note, please don't use ruby's own internal_encoding transcode
    # with binary marc data, the transcode can mess up the byte count
    # and make it unreadable. 
    @encoding_options[:external_encoding] ||= @handle.external_encoding
  end

  # Only pull in the MARC8 translation if we need it, since it's really big
  if @encoding_options[:external_encoding]  == "MARC-8"
    require 'marc/marc8/to_unicode' unless defined? MARC::Marc8::ToUnicode
  end

end

Class Method Details

.decode(marc, params = {}) ⇒ `Object`

A static method for turning raw MARC data in transission format into a MARC::Record object. First argument is a String options include:

[:external_encoding]  encoding of MARC record data values
[:forgiving]          needs more docs, true is some kind of forgiving 
                      of certain kinds of bad MARC.

Raises:

(MARC::Exception)

# File 'lib/marc/reader.rb', line 293

def self.decode(marc, params={})
  if params.has_key?(:encoding)
    $stderr.puts "DEPRECATION WARNING: MARC::Reader.decode :encoding option deprecated, please use :external_encoding"
    params[:external_encoding] = params.delete(:encoding)
  end
  
  if (! params.has_key? :external_encoding ) && marc.respond_to?(:encoding)
    # If no forced external_encoding giving, respect the encoding
    # declared on the string passed in. 
    params[:external_encoding] = marc.encoding
  end
  # And now that we've recorded the current encoding, we force
  # to binary encoding, because we're going to be doing byte arithmetic,
  # and want to avoid byte-vs-char confusion. 
  marc.force_encoding("binary") if marc.respond_to?(:force_encoding)
  
  record = Record.new()
  record.leader = marc[0..LEADER_LENGTH-1]

  # where the field data starts
  base_address = record.leader[12..16].to_i

  # get the byte offsets from the record directory
  directory = marc[LEADER_LENGTH..base_address-1]

  raise MARC::Exception.new("invalid directory in record") if directory == nil

  # the number of fields in the record corresponds to
  # how many directory entries there are
  num_fields = directory.length / DIRECTORY_ENTRY_LENGTH

  # when operating in forgiving mode we just split on end of
  # field instead of using calculated byte offsets from the
  # directory
  if params[:forgiving]        
    marc_field_data = marc[base_address..-1]
    # It won't let us do the split on bad utf8 data, but
    # we haven't yet set the 'proper' encoding or used
    # our correction/replace options. So call it binary for now.
    marc_field_data.force_encoding("binary") if marc_field_data.respond_to?(:force_encoding)
    
    all_fields = marc_field_data.split(END_OF_FIELD)
  else
    mba =  marc.bytes.to_a
  end

  0.upto(num_fields-1) do |field_num|

    # pull the directory entry for a field out
    entry_start = field_num * DIRECTORY_ENTRY_LENGTH
    entry_end = entry_start + DIRECTORY_ENTRY_LENGTH
    entry = directory[entry_start..entry_end]

    # extract the tag
    tag = entry[0..2]

    # get the actual field data
    # if we were told to be forgiving we just use the
    # next available chuck of field data that we
    # split apart based on the END_OF_FIELD
    field_data = ''
    if params[:forgiving]
      field_data = all_fields.shift()

    # otherwise we actually use the byte offsets in
    # directory to figure out what field data to extract
    else
      length = entry[3..6].to_i
      offset = entry[7..11].to_i
      field_start = base_address + offset
      field_end = field_start + length - 1
      field_data = mba[field_start..field_end].pack("c*")
    end

    # remove end of field
    field_data.delete!(END_OF_FIELD)
    
    # add a control field or data field
    if MARC::ControlField.control_tag?(tag)
      field_data = MARC::Reader.set_encoding( field_data , params)
      record.append(MARC::ControlField.new(tag,field_data))
    else
      field = MARC::DataField.new(tag)

      # get all subfields
      subfields = field_data.split(SUBFIELD_INDICATOR)

      # must have at least 2 elements (indicators, and 1 subfield)
      # TODO some sort of logging?
      next if subfields.length() < 2

      # get indicators
      indicators = MARC::Reader.set_encoding( subfields.shift(), params)
      field.indicator1 = indicators[0,1]
      field.indicator2 = indicators[1,1]

      # add each subfield to the field
      subfields.each() do |data|
        data = MARC::Reader.set_encoding( data, params )
        subfield = MARC::Subfield.new(data[0,1],data[1..-1])
        field.append(subfield)
      end

      # add the field to the record
      record.append(field)
    end
  end

  return record
end

.set_encoding(str, params) ⇒ `Object`

input passed in probably has ‘binary’ encoding. We’ll set it to the proper encoding, and depending on settings, optionally

check for valid encoding
- raise if not valid
- or replace bad bytes with replacement chars if not valid
transcode from external_encoding to internal_encoding

Special case for encoding “MARC-8” – will be transcoded to UTF-8 (then further transcoded to external_encoding, if set). For “MARC-8”, validate_encoding is always true, there’s no way to ignore bad bytes.

Params options:

* external_encoding: what encoding the input is expected to be in  
* validate_encoding: if true, will raise if an invalid encoding
* invalid:  if set to :replace, will replace bad bytes with replacement
            chars instead of raising. 
* replace: Set replacement char for use with 'invalid', otherwise defaults
           to unicode replacement char, or question mark.

# File 'lib/marc/reader.rb', line 424

def self.set_encoding(str, params)
  if str.respond_to?(:force_encoding)
    if params[:external_encoding]
      if params[:external_encoding] == "MARC-8"
        transcode_params = [:invalid, :replace].each_with_object({}) { |k, hash| hash[k] = params[k] if params.has_key?(k) }
        str = MARC::Marc8::ToUnicode.new.transcode(str, transcode_params)
      else
        str = str.force_encoding(params[:external_encoding])
      end
    end     
        
    # If we're transcoding anyway, pass our invalid/replace options
    # on to String#encode, which will take care of them -- or raise
    # with illegal bytes without :replace=>:invalid. 
    #
    # If we're NOT transcoding, we need to use our own pure-ruby
    # implementation to do invalid byte replacements. OR to raise
    # a predicatable exception iff :validate_encoding, otherwise
    # for performance we won't check, and you may or may not
    # get an exception from inside ruby-marc, and it may change
    # in future implementations. 
    if params[:internal_encoding]
      if RUBY_VERSION >= '3.0'
        str = str.encode(params[:internal_encoding], **params)
      else
        str = str.encode(params[:internal_encoding], params)
      end
    elsif (params[:invalid] || params[:replace] || (params[:validate_encoding] == true))

      if params[:validate_encoding] == true && ! str.valid_encoding?
        raise  Encoding::InvalidByteSequenceError.new("invalid byte in string for source encoding #{str.encoding.name}")
      end
      if params[:invalid] == :replace
        str = str.scrub(params[:replace])
      end
      
     end          
   end
   return str
end

Instance Method Details

#decode(marc) ⇒ `Object`

Decodes the given string into a MARC::Record object.

Wraps the class method MARC::Reader::decode, using the encoding options of the MARC::Reader instance.



282
283
284

# File 'lib/marc/reader.rb', line 282

def decode(marc)
  return MARC::Reader.decode(marc, @encoding_options)
end

#each ⇒ `Object`

to support iteration:

for record in reader
  print record
end

# File 'lib/marc/reader.rb', line 228

def each
  unless block_given?
    return self.enum_for(:each)
  else
    self.each_raw do |raw|
      record = self.decode(raw)
      yield record
    end
  end
end

#each_raw ⇒ `Object`

Iterates over each record as a raw String, rather than a decoded MARC::Record

This allows for handling encoding exceptions per record (e.g. to log which record caused the error):

reader = MARC::Reader.new("marc_with_some_bad_records.dat",
                              :external_encoding => "UTF-8",
                              :validate_encoding => true)
reader.each_raw do |raw|
  begin
    record = reader.decode(raw)
  rescue Encoding::InvalidByteSequenceError => e
    record = MARC::Reader.decode(raw, :external_encoding => "UTF-8",
                                      :invalid => :replace)
    warn e.message, record
  end
end

If no block is given, an enumerator is returned

# File 'lib/marc/reader.rb', line 259

def each_raw
  unless block_given?
    return self.enum_for(:each_raw)
  else
    while rec_length_s = @handle.read(5)
      # make sure the record length looks like an integer
      rec_length_i = rec_length_s.to_i
      if rec_length_i == 0
        raise MARC::Exception.new("invalid record length: #{rec_length_s}")
      end

      # get the raw MARC21 for a record back from the file
      # using the record length
      raw = rec_length_s + @handle.read(rec_length_i-5)
      yield raw
    end
  end
end