Class: Modsulator

Inherits:
Object
  • Object
show all
Extended by:
Deprecation
Defined in:
lib/modsulator.rb

Overview

The main class for the MODSulator API, which lets you work with metadata spreadsheets and MODS XML.

Constant Summary collapse

NAMESPACE =

We define our own namespace for <xmlDocs>

'http://library.stanford.edu/xmlDocs'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(file, filename, options = {}) ⇒ Modsulator

The reason for requiring both a file and filename is that within the API that is one of the users of this class, the file and filename exist separately. Note that if neither :template_file nor :template_string are specified, the gem’s built-in XML template is used.

Parameters:

  • file (File)

    Input spreadsheet file.

  • filename (String)

    The filename for the input spreadsheet.

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :template_file (String)

    The full path to the desired template file (a spreadsheet).

  • :template_string (String)

    The template contents as a string



33
34
35
36
37
38
39
40
41
42
43
44
45
46
# File 'lib/modsulator.rb', line 33

def initialize(file, filename, options = {})
  @file = file
  @filename = filename

  @rows = ModsulatorSheet.new(@file, @filename).rows

  if options[:template_string]
    @template_xml = options[:template_string]
  elsif options[:template_file]
    @template_xml = File.read(options[:template_file])
  else
    @template_xml = File.read(File.expand_path('../modsulator/modsulator_template.xml', __FILE__))
  end
end

Instance Attribute Details

#fileObject (readonly)

Returns the value of attribute file.



22
23
24
# File 'lib/modsulator.rb', line 22

def file
  @file
end

#rowsObject (readonly)

Returns the value of attribute rows.



22
23
24
# File 'lib/modsulator.rb', line 22

def rows
  @rows
end

#template_xmlObject (readonly)

Returns the value of attribute template_xml.



22
23
24
# File 'lib/modsulator.rb', line 22

def template_xml
  @template_xml
end

Class Method Details

.get_template_spreadsheetString

Returns the template spreadsheet that’s built into this gem.

Returns:

  • (String)

    The template spreadsheet, in binary form.



190
191
192
# File 'lib/modsulator.rb', line 190

def get_template_spreadsheet
  IO.read(File.expand_path('../modsulator/modsulator_template.xlsx', __FILE__), mode: 'rb')
end

.template_spreadsheet_pathString

This can be used by modsulator-rails-app so we can do:

send_file Modsulator.template_spreadsheet_path

which is more memory efficient than:

render body: Modsulator.get_template_spreadsheet

Returns:

  • (String)

    the path to the spreadsheet template.



200
201
202
# File 'lib/modsulator.rb', line 200

def template_spreadsheet_path
  File.expand_path('../modsulator/modsulator_template.xlsx', __FILE__)
end

Instance Method Details

#convert_rowsString

Generates an XML document with one <mods> entry per input row. Example output:

<xmlDocs datetime="2015-03-23 09:22:11AM" sourceFile="FitchMLK-v1.xlsx">
     <xmlDoc id="descMetadata" objectId="druid:aa111aa1111">
         <mods ... >
             :
         </mods>
     </xmlDoc>
     <xmlDoc id="descMetadata" objectId="druid:aa222aa2222">
         <mods ... >
             :
         </mods>
     </xmlDoc>
</xmlDocs>

Returns:

  • (String)

    An XML string containing all the <mods> documents within a nested structure as shown in the example.



64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/modsulator.rb', line 64

def convert_rows
  time_stamp = Time.now.strftime('%Y-%m-%d %I:%M:%S%p')
  header = "<xmlDocs xmlns=\"#{NAMESPACE}\" datetime=\"#{time_stamp}\" sourceFile=\"#{@filename}\">"
  full_doc = Nokogiri::XML(header)
  root = full_doc.root

  @rows.each do |row|
    mods_xml_doc = row_to_xml(row)
    sub_doc = full_doc.create_element('xmlDoc', { id: 'descMetadata', objectId: "#{row['druid']}" })
    sub_doc.add_child(mods_xml_doc.root)
    root.add_child(sub_doc)
  end

  full_doc.to_s
end

#generate_normalized_mods(output_directory) ⇒ Void

Generates normalized (Stanford) MODS XML, writing output to files.

Parameters:

  • output_directory (String)

    The directory where output files should be stored.

Returns:

  • (Void)


142
143
144
145
146
147
148
149
150
151
# File 'lib/modsulator.rb', line 142

def generate_normalized_mods(output_directory)
  # Write one XML file per data row in the input spreadsheet
  rows.each do |row|
    sourceid = row['sourceId']
    output_filename = output_directory + '/' + sourceid + '.xml'

    mods_doc = row_to_xml(row)
    File.open(output_filename, 'w') { |fh| fh.puts(mods_doc.root.to_s) }
  end
end

#generate_xml(metadata_row) ⇒ String

Generates an XML string for a given row in a spreadsheet.

Parameters:

  • metadata_row (Hash)

    A single row in a MODS metadata spreadsheet, as provided by the ModsulatorSheet#rows method.

Returns:

  • (String)

    XML template, with data from the row substituted in.



105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/modsulator.rb', line 105

def generate_xml()
  manifest_row = 

  # XML escape all of the entries in the manifest row so they won't break the XML. This will turn any '<' into &#lt;
  # and international characters into their corresponding code point etc.
  manifest_row.each do |k, v|
    next unless v
    v = transform_whitespace_markup(v) if v.instance_of?(String) && has_whitespace_markup?(v)
    manifest_row[k] = Nokogiri::XML::Text.new(v.to_s, Nokogiri::XML('')).to_s
  end


  # Enable access with symbol or string keys
  manifest_row = manifest_row.with_indifferent_access

  # Run the XML template through ERB. This creates a new ERB object from the template XML,
  # NOT creating a separate thread, and omitting newlines for lines ending with '%>'
  template     = ERB.new(template_xml, nil, '>')

  # ERB.result() actually computes the template. This just passes the top level binding.
   = template.result(binding)

  # The manifest_row is a hash, with column names as the key.
  # In the template, as a convenience we allow users to put specific column placeholders inside
  # double brackets: "blah [[column_name]] blah".
  # Here we replace those placeholders with the corresponding value
  # from the manifest row.
  manifest_row.each { |k, v| .gsub!("[[#{k}]]", v.to_s.strip) }

  
end

#has_whitespace_markup?(str) ⇒ Boolean

Checks whether or not a string contains any
or <p> markup.

Parameters:

  • str (String)

    Any string.

Returns:

  • (Boolean)

    true if the given string contains paragraph or line break HTML markup, false otherwise.



85
86
87
# File 'lib/modsulator.rb', line 85

def has_whitespace_markup?(str)
  str.match('<br>') || str.match('<br/>') || str.match('<p>') || str.match('<p/>')
end

#row_to_xml(row) ⇒ Object

Converts a single data row into a normalized MODS XML document.

Parameters:

  • row

    A single row in a MODS metadata spreadsheet, as provided by the ModsulatorSheet#rows method.

Returns:

  • An instance of Nokogiri::XML::Document that holds a normalized MODS XML instance.



170
171
172
173
174
175
176
177
178
179
180
181
182
183
# File 'lib/modsulator.rb', line 170

def row_to_xml(row)

  # Generate an XML string, then remove any text carried over from the template
  mods_xml = generate_xml(row)
  mods_xml.gsub!(/\[\[[^\]]+\]\]/, '')

  # Remove empty tags from when e.g. <[[sn1:p2:type]]> does not get filled in when [[sn1:p2:type]] has no value in the source spreadsheet
  mods_xml.gsub!(/<\s[^>]+><\/>/, '')

  mods_xml_doc = Nokogiri::XML(mods_xml)
  normalizer = Stanford::Mods::Normalizer.new
  normalizer.normalize_document(mods_xml_doc.root)
  return mods_xml_doc
end

#transform_whitespace_markup(str) ⇒ String

Transforms HTML paragraph and line break markup tags to newline characters. This should be run before escaping any XML characters.

Parameters:

  • str (String)

    String to transform.

Returns:

  • (String)

    The given string, with a single newline character substituted for line break tags and two consecutive newline characters substituted for paragraph tags.



96
97
98
# File 'lib/modsulator.rb', line 96

def transform_whitespace_markup(str)
  str.gsub(/<br\/>/, '\n').gsub(/<br>/, '\n').gsub(/<p>/, '\n\n').gsub(/<p\/>/, '\n\n')
end

#validate_headers(spreadsheet_headers) ⇒ Array<String>

Checks that all the headers in the spreadsheet has a corresponding entry in the XML template.

Parameters:

  • spreadsheet_headers (Array<String>)

    A list of all the headers in the spreadsheet

Returns:

  • (Array<String>)

    A list of spreadsheet headers that did not appear in the XML template. This list will be empty if all the headers were present.



159
160
161
162
163
# File 'lib/modsulator.rb', line 159

def validate_headers(spreadsheet_headers)
  spreadsheet_headers.reject do |header|
    header.nil? || header == 'sourceId' || template_xml.include?(header)
  end
end