Class: Modsulator

Inherits:
Object
  • Object
show all
Defined in:
lib/modsulator.rb

Overview

The main class for the MODSulator API, which lets you work with metadata spreadsheets and MODS XML.

Constant Summary collapse

NAMESPACE =

We define our own namespace for <xmlDocs>

'http://library.stanford.edu/xmlDocs'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(file, filename, options = {}) ⇒ Modsulator

The reason for requiring both a file and filename is that within the API that is one of the users of this class, the file and filename exist separately. Note that if neither :template_file nor :template_string are specified, the gem’s built-in XML template is used.

Parameters:

  • file (File)

    Input spreadsheet file.

  • filename (String)

    The filename for the input spreadsheet.

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :template_file (String)

    The full path to the desired template file (a spreadsheet).

  • :template_string (String)

    The template contents as a string



31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/modsulator.rb', line 31

def initialize(file, filename, options = {})
  @file = file
  @filename = filename

  @rows = ModsulatorSheet.new(@file, @filename).rows

  if options[:template_string]
    @template_xml = options[:template_string]
  elsif options[:template_file]
    @template_xml = File.read(options[:template_file])
  else
    @template_xml = File.read(File.expand_path('../modsulator/modsulator_template.xml', __FILE__))
  end
end

Instance Attribute Details

#fileObject (readonly)

Returns the value of attribute file.



20
21
22
# File 'lib/modsulator.rb', line 20

def file
  @file
end

#rowsObject (readonly)

Returns the value of attribute rows.



20
21
22
# File 'lib/modsulator.rb', line 20

def rows
  @rows
end

#template_xmlObject (readonly)

Returns the value of attribute template_xml.



20
21
22
# File 'lib/modsulator.rb', line 20

def template_xml
  @template_xml
end

Class Method Details

.get_template_spreadsheetObject

Returns the template spreadsheet that’s built into this gem.

Returns:

  • The template spreadsheet, in binary form.



187
188
189
# File 'lib/modsulator.rb', line 187

def self.get_template_spreadsheet
  IO.read(File.expand_path('../modsulator/modsulator_template.xlsx', __FILE__), mode: 'rb')
end

Instance Method Details

#convert_rowsString

Generates an XML document with one <mods> entry per input row. Example output:

<xmlDocs datetime="2015-03-23 09:22:11AM" sourceFile="FitchMLK-v1.xlsx">
     <xmlDoc id="descMetadata" objectId="druid:aa111aa1111">
         <mods ... >
             :
         </mods>
     </xmlDoc>
     <xmlDoc id="descMetadata" objectId="druid:aa222aa2222">
         <mods ... >
             :
         </mods>
     </xmlDoc>
</xmlDocs>

Returns:

  • (String)

    An XML string containing all the <mods> documents within a nested structure as shown in the example.



62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/modsulator.rb', line 62

def convert_rows
  time_stamp = Time.now.strftime('%Y-%m-%d %I:%M:%S%p')
  header = "<xmlDocs xmlns=\"#{NAMESPACE}\" datetime=\"#{time_stamp}\" sourceFile=\"#{@filename}\">"
  full_doc = Nokogiri::XML(header)
  root = full_doc.root

  @rows.each do |row|
    mods_xml_doc = row_to_xml(row)
    sub_doc = full_doc.create_element('xmlDoc', { id: 'descMetadata', objectId: "#{row['druid']}" })
    sub_doc.add_child(mods_xml_doc.root)
    root.add_child(sub_doc)
  end

  full_doc.to_s
end

#generate_normalized_mods(output_directory) ⇒ Void

Generates normalized (Stanford) MODS XML, writing output to files.

Parameters:

  • output_directory (String)

    The directory where output files should be stored.

Returns:

  • (Void)


140
141
142
143
144
145
146
147
148
149
# File 'lib/modsulator.rb', line 140

def generate_normalized_mods(output_directory)
  # Write one XML file per data row in the input spreadsheet
  rows.each do |row|
    sourceid = row['sourceId']
    output_filename = output_directory + '/' + sourceid + '.xml'

    mods_doc = row_to_xml(row)
    File.open(output_filename, 'w') { |fh| fh.puts(mods_doc.root.to_s) }
  end
end

#generate_xml(metadata_row) ⇒ String

Generates an XML string for a given row in a spreadsheet.

Parameters:

  • metadata_row (Hash)

    A single row in a MODS metadata spreadsheet, as provided by the ModsulatorSheet#rows method.

Returns:

  • (String)

    XML template, with data from the row substituted in.



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# File 'lib/modsulator.rb', line 103

def generate_xml()
  manifest_row = 

  # XML escape all of the entries in the manifest row so they won't break the XML. This will turn any '<' into &#lt;
  # and international characters into their corresponding code point etc.
  manifest_row.each do |k, v|
    next unless v
    v = transform_whitespace_markup(v) if v.instance_of?(String) && has_whitespace_markup?(v)
    manifest_row[k] = Nokogiri::XML::Text.new(v.to_s, Nokogiri::XML('')).to_s
  end
  

  # Enable access with symbol or string keys
  manifest_row = manifest_row.with_indifferent_access

  # Run the XML template through ERB. This creates a new ERB object from the template XML,
  # NOT creating a separate thread, and omitting newlines for lines ending with '%>'
  template     = ERB.new(template_xml, nil, '>')

  # ERB.result() actually computes the template. This just passes the top level binding.
   = template.result(binding)

  # The manifest_row is a hash, with column names as the key.
  # In the template, as a convenience we allow users to put specific column placeholders inside
  # double brackets: "blah [[column_name]] blah".
  # Here we replace those placeholders with the corresponding value
  # from the manifest row.
  manifest_row.each { |k, v| .gsub!("[[#{k}]]", v.to_s.strip) }

  
end

#has_whitespace_markup?(str) ⇒ Boolean

Checks whether or not a string contains any
or <p> markup.

Parameters:

  • str (String)

    Any string.

Returns:

  • (Boolean)

    true if the given string contains paragraph or line break HTML markup, false otherwise.



83
84
85
# File 'lib/modsulator.rb', line 83

def has_whitespace_markup?(str)
  str.match('<br>') || str.match('<br/>') || str.match('<p>') || str.match('<p/>')
end

#row_to_xml(row) ⇒ Object

Converts a single data row into a normalized MODS XML document.

Parameters:

  • row

    A single row in a MODS metadata spreadsheet, as provided by the ModsulatorSheet#rows method.

Returns:

  • An instance of Nokogiri::XML::Document that holds a normalized MODS XML instance.



168
169
170
171
172
173
174
175
176
177
178
179
180
181
# File 'lib/modsulator.rb', line 168

def row_to_xml(row)

  # Generate an XML string, then remove any text carried over from the template
  mods_xml = generate_xml(row)
  mods_xml.gsub!(/\[\[[^\]]+\]\]/, '')

  # Remove empty tags from when e.g. <[[sn1:p2:type]]> does not get filled in when [[sn1:p2:type]] has no value in the source spreadsheet
  mods_xml.gsub!(/<\s[^>]+><\/>/, '')

  mods_xml_doc = Nokogiri::XML(mods_xml)
  normalizer = Normalizer.new
  normalizer.normalize_document(mods_xml_doc.root)
  return mods_xml_doc
end

#transform_whitespace_markup(str) ⇒ String

Transforms HTML paragraph and line break markup tags to newline characters. This should be run before escaping any XML characters.

Parameters:

  • str (String)

    String to transform.

Returns:

  • (String)

    The given string, with a single newline character substituted for line break tags and two consecutive newline characters substituted for paragraph tags.



94
95
96
# File 'lib/modsulator.rb', line 94

def transform_whitespace_markup(str)
  str.gsub(/<br\/>/, '\n').gsub(/<br>/, '\n').gsub(/<p>/, '\n\n').gsub(/<p\/>/, '\n\n')
end

#validate_headers(spreadsheet_headers) ⇒ Array<String>

Checks that all the headers in the spreadsheet has a corresponding entry in the XML template.

Parameters:

  • spreadsheet_headers (Array<String>)

    A list of all the headers in the spreadsheet

Returns:

  • (Array<String>)

    A list of spreadsheet headers that did not appear in the XML template. This list will be empty if all the headers were present.



157
158
159
160
161
# File 'lib/modsulator.rb', line 157

def validate_headers(spreadsheet_headers)
  spreadsheet_headers.reject do |header|
    header.nil? || header == 'sourceId' || template_xml.include?(header)
  end
end