Class: Modsulator
Overview
The main class for the MODSulator API, which lets you work with metadata spreadsheets and MODS XML.
Constant Summary collapse
- NAMESPACE =
We define our own namespace for <xmlDocs>
'http://library.stanford.edu/xmlDocs'
Instance Attribute Summary collapse
-
#file ⇒ Object
readonly
Returns the value of attribute file.
-
#rows ⇒ Object
readonly
Returns the value of attribute rows.
-
#template_xml ⇒ Object
readonly
Returns the value of attribute template_xml.
Class Method Summary collapse
-
.get_template_spreadsheet ⇒ String
Returns the template spreadsheet that’s built into this gem.
-
.template_spreadsheet_path ⇒ String
This can be used by modsulator-rails-app so we can do: send_file Modsulator.template_spreadsheet_path which is more memory efficient than: render body: Modsulator.get_template_spreadsheet.
Instance Method Summary collapse
-
#convert_rows ⇒ String
Generates an XML document with one <mods> entry per input row.
-
#generate_normalized_mods(output_directory) ⇒ Void
Generates normalized (Stanford) MODS XML, writing output to files.
-
#generate_xml(metadata_row) ⇒ String
Generates an XML string for a given row in a spreadsheet.
-
#has_whitespace_markup?(str) ⇒ Boolean
Checks whether or not a string contains any
or <p> markup. -
#initialize(file, filename, options = {}) ⇒ Modsulator
constructor
The reason for requiring both a file and filename is that within the API that is one of the users of this class, the file and filename exist separately.
-
#row_to_xml(row) ⇒ Object
Converts a single data row into a normalized MODS XML document.
-
#transform_whitespace_markup(str) ⇒ String
Transforms HTML paragraph and line break markup tags to newline characters.
-
#validate_headers(spreadsheet_headers) ⇒ Array<String>
Checks that all the headers in the spreadsheet has a corresponding entry in the XML template.
Constructor Details
#initialize(file, filename, options = {}) ⇒ Modsulator
The reason for requiring both a file and filename is that within the API that is one of the users of this class, the file and filename exist separately. Note that if neither :template_file nor :template_string are specified, the gem’s built-in XML template is used.
33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# File 'lib/modsulator.rb', line 33 def initialize(file, filename, = {}) @file = file @filename = filename @rows = ModsulatorSheet.new(@file, @filename).rows if [:template_string] @template_xml = [:template_string] elsif [:template_file] @template_xml = File.read([:template_file]) else @template_xml = File.read(File.('../modsulator/modsulator_template.xml', __FILE__)) end end |
Instance Attribute Details
#file ⇒ Object (readonly)
Returns the value of attribute file.
22 23 24 |
# File 'lib/modsulator.rb', line 22 def file @file end |
#rows ⇒ Object (readonly)
Returns the value of attribute rows.
22 23 24 |
# File 'lib/modsulator.rb', line 22 def rows @rows end |
#template_xml ⇒ Object (readonly)
Returns the value of attribute template_xml.
22 23 24 |
# File 'lib/modsulator.rb', line 22 def template_xml @template_xml end |
Class Method Details
.get_template_spreadsheet ⇒ String
Returns the template spreadsheet that’s built into this gem.
190 191 192 |
# File 'lib/modsulator.rb', line 190 def get_template_spreadsheet IO.read(File.('../modsulator/modsulator_template.xlsx', __FILE__), mode: 'rb') end |
.template_spreadsheet_path ⇒ String
This can be used by modsulator-rails-app so we can do:
send_file Modsulator.template_spreadsheet_path
which is more memory efficient than:
render body: Modsulator.get_template_spreadsheet
200 201 202 |
# File 'lib/modsulator.rb', line 200 def template_spreadsheet_path File.('../modsulator/modsulator_template.xlsx', __FILE__) end |
Instance Method Details
#convert_rows ⇒ String
Generates an XML document with one <mods> entry per input row. Example output:
<xmlDocs datetime="2015-03-23 09:22:11AM" sourceFile="FitchMLK-v1.xlsx">
<xmlDoc id="descMetadata" objectId="druid:aa111aa1111">
<mods ... >
:
</mods>
</xmlDoc>
<xmlDoc id="descMetadata" objectId="druid:aa222aa2222">
<mods ... >
:
</mods>
</xmlDoc>
</xmlDocs>
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/modsulator.rb', line 64 def convert_rows time_stamp = Time.now.strftime('%Y-%m-%d %I:%M:%S%p') header = "<xmlDocs xmlns=\"#{NAMESPACE}\" datetime=\"#{time_stamp}\" sourceFile=\"#{@filename}\">" full_doc = Nokogiri::XML(header) root = full_doc.root @rows.each do |row| mods_xml_doc = row_to_xml(row) sub_doc = full_doc.create_element('xmlDoc', { id: 'descMetadata', objectId: "#{row['druid']}" }) sub_doc.add_child(mods_xml_doc.root) root.add_child(sub_doc) end full_doc.to_s end |
#generate_normalized_mods(output_directory) ⇒ Void
Generates normalized (Stanford) MODS XML, writing output to files.
142 143 144 145 146 147 148 149 150 151 |
# File 'lib/modsulator.rb', line 142 def generate_normalized_mods(output_directory) # Write one XML file per data row in the input spreadsheet rows.each do |row| sourceid = row['sourceId'] output_filename = output_directory + '/' + sourceid + '.xml' mods_doc = row_to_xml(row) File.open(output_filename, 'w') { |fh| fh.puts(mods_doc.root.to_s) } end end |
#generate_xml(metadata_row) ⇒ String
Generates an XML string for a given row in a spreadsheet.
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/modsulator.rb', line 105 def generate_xml() manifest_row = # XML escape all of the entries in the manifest row so they won't break the XML. This will turn any '<' into &#lt; # and international characters into their corresponding code point etc. manifest_row.each do |k, v| next unless v v = transform_whitespace_markup(v) if v.instance_of?(String) && has_whitespace_markup?(v) manifest_row[k] = Nokogiri::XML::Text.new(v.to_s, Nokogiri::XML('')).to_s end # Enable access with symbol or string keys manifest_row = manifest_row.with_indifferent_access # Run the XML template through ERB. This creates a new ERB object from the template XML, # NOT creating a separate thread, and omitting newlines for lines ending with '%>' template = ERB.new(template_xml, nil, '>') # ERB.result() actually computes the template. This just passes the top level binding. = template.result(binding) # The manifest_row is a hash, with column names as the key. # In the template, as a convenience we allow users to put specific column placeholders inside # double brackets: "blah [[column_name]] blah". # Here we replace those placeholders with the corresponding value # from the manifest row. manifest_row.each { |k, v| .gsub!("[[#{k}]]", v.to_s.strip) } end |
#has_whitespace_markup?(str) ⇒ Boolean
Checks whether or not a string contains any
or <p> markup.
85 86 87 |
# File 'lib/modsulator.rb', line 85 def has_whitespace_markup?(str) str.match('<br>') || str.match('<br/>') || str.match('<p>') || str.match('<p/>') end |
#row_to_xml(row) ⇒ Object
Converts a single data row into a normalized MODS XML document.
170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
# File 'lib/modsulator.rb', line 170 def row_to_xml(row) # Generate an XML string, then remove any text carried over from the template mods_xml = generate_xml(row) mods_xml.gsub!(/\[\[[^\]]+\]\]/, '') # Remove empty tags from when e.g. <[[sn1:p2:type]]> does not get filled in when [[sn1:p2:type]] has no value in the source spreadsheet mods_xml.gsub!(/<\s[^>]+><\/>/, '') mods_xml_doc = Nokogiri::XML(mods_xml) normalizer = Stanford::Mods::Normalizer.new normalizer.normalize_document(mods_xml_doc.root) return mods_xml_doc end |
#transform_whitespace_markup(str) ⇒ String
Transforms HTML paragraph and line break markup tags to newline characters. This should be run before escaping any XML characters.
96 97 98 |
# File 'lib/modsulator.rb', line 96 def transform_whitespace_markup(str) str.gsub(/<br\/>/, '\n').gsub(/<br>/, '\n').gsub(/<p>/, '\n\n').gsub(/<p\/>/, '\n\n') end |
#validate_headers(spreadsheet_headers) ⇒ Array<String>
Checks that all the headers in the spreadsheet has a corresponding entry in the XML template.
159 160 161 162 163 |
# File 'lib/modsulator.rb', line 159 def validate_headers(spreadsheet_headers) spreadsheet_headers.reject do |header| header.nil? || header == 'sourceId' || template_xml.include?(header) end end |