Class: Modsulator
Overview
The main class for the MODSulator API, which lets you work with metadata spreadsheets and MODS XML.
Constant Summary collapse
- NAMESPACE =
We define our own namespace for <xmlDocs>
'http://library.stanford.edu/xmlDocs'
Instance Attribute Summary collapse
-
#file ⇒ Object
readonly
Returns the value of attribute file.
-
#rows ⇒ Object
readonly
Returns the value of attribute rows.
-
#template_xml ⇒ Object
readonly
Returns the value of attribute template_xml.
Class Method Summary collapse
-
.get_template_spreadsheet ⇒ String
Returns the template spreadsheet that’s built into this gem.
-
.template_spreadsheet_path ⇒ String
This can be used by modsulator-rails-app so we can do: send_file Modsulator.template_spreadsheet_path which is more memory efficient than: render body: Modsulator.get_template_spreadsheet.
Instance Method Summary collapse
-
#convert_rows ⇒ String
Generates an XML document with one <mods> entry per input row.
-
#generate_normalized_mods(output_directory) ⇒ Void
Generates normalized (Stanford) MODS XML, writing output to files.
-
#generate_xml(metadata_row) ⇒ String
Generates an XML string for a given row in a spreadsheet.
-
#has_whitespace_markup?(str) ⇒ Boolean
Checks whether or not a string contains any
or <p> markup. -
#initialize(file, filename, options = {}) ⇒ Modsulator
constructor
The reason for requiring both a file and filename is that within the API that is one of the users of this class, the file and filename exist separately.
-
#row_to_xml(row) ⇒ Object
Converts a single data row into a normalized MODS XML document.
-
#transform_whitespace_markup(str) ⇒ String
Transforms HTML paragraph and line break markup tags to newline characters.
-
#validate_headers(spreadsheet_headers) ⇒ Array<String>
Checks that all the headers in the spreadsheet has a corresponding entry in the XML template.
Constructor Details
#initialize(file, filename, options = {}) ⇒ Modsulator
The reason for requiring both a file and filename is that within the API that is one of the users of this class, the file and filename exist separately. Note that if neither :template_file nor :template_string are specified, the gem’s built-in XML template is used.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# File 'lib/modsulator.rb', line 24 def initialize(file, filename, = {}) @file = file @filename = filename @rows = ModsulatorSheet.new(@file, @filename).rows if [:template_string] @template_xml = [:template_string] elsif [:template_file] @template_xml = File.read([:template_file]) else @template_xml = File.read(File.('../modsulator/modsulator_template.xml', __FILE__)) end end |
Instance Attribute Details
#file ⇒ Object (readonly)
Returns the value of attribute file.
13 14 15 |
# File 'lib/modsulator.rb', line 13 def file @file end |
#rows ⇒ Object (readonly)
Returns the value of attribute rows.
13 14 15 |
# File 'lib/modsulator.rb', line 13 def rows @rows end |
#template_xml ⇒ Object (readonly)
Returns the value of attribute template_xml.
13 14 15 |
# File 'lib/modsulator.rb', line 13 def template_xml @template_xml end |
Class Method Details
.get_template_spreadsheet ⇒ String
Returns the template spreadsheet that’s built into this gem.
181 182 183 |
# File 'lib/modsulator.rb', line 181 def get_template_spreadsheet IO.read(File.('../modsulator/modsulator_template.xlsx', __FILE__), mode: 'rb') end |
.template_spreadsheet_path ⇒ String
This can be used by modsulator-rails-app so we can do:
send_file Modsulator.template_spreadsheet_path
which is more memory efficient than:
render body: Modsulator.get_template_spreadsheet
191 192 193 |
# File 'lib/modsulator.rb', line 191 def template_spreadsheet_path File.('../modsulator/modsulator_template.xlsx', __FILE__) end |
Instance Method Details
#convert_rows ⇒ String
Generates an XML document with one <mods> entry per input row. Example output:
<xmlDocs datetime="2015-03-23 09:22:11AM" sourceFile="FitchMLK-v1.xlsx">
<xmlDoc id="descMetadata" objectId="druid:aa111aa1111">
<mods ... >
:
</mods>
</xmlDoc>
<xmlDoc id="descMetadata" objectId="druid:aa222aa2222">
<mods ... >
:
</mods>
</xmlDoc>
</xmlDocs>
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
# File 'lib/modsulator.rb', line 55 def convert_rows time_stamp = Time.now.strftime('%Y-%m-%d %I:%M:%S%p') header = "<xmlDocs xmlns=\"#{NAMESPACE}\" datetime=\"#{time_stamp}\" sourceFile=\"#{@filename}\">" full_doc = Nokogiri::XML(header) root = full_doc.root @rows.each do |row| mods_xml_doc = row_to_xml(row) sub_doc = full_doc.create_element('xmlDoc', { id: 'descMetadata', objectId: "#{row['druid']}" }) sub_doc.add_child(mods_xml_doc.root) root.add_child(sub_doc) end full_doc.to_s end |
#generate_normalized_mods(output_directory) ⇒ Void
Generates normalized (Stanford) MODS XML, writing output to files.
133 134 135 136 137 138 139 140 141 142 |
# File 'lib/modsulator.rb', line 133 def generate_normalized_mods(output_directory) # Write one XML file per data row in the input spreadsheet rows.each do |row| sourceid = row['sourceId'] output_filename = output_directory + '/' + sourceid + '.xml' mods_doc = row_to_xml(row) File.open(output_filename, 'w') { |fh| fh.puts(mods_doc.root.to_s) } end end |
#generate_xml(metadata_row) ⇒ String
Generates an XML string for a given row in a spreadsheet.
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
# File 'lib/modsulator.rb', line 96 def generate_xml() manifest_row = # XML escape all of the entries in the manifest row so they won't break the XML. This will turn any '<' into &#lt; # and international characters into their corresponding code point etc. manifest_row.each do |k, v| next unless v v = transform_whitespace_markup(v) if v.instance_of?(String) && has_whitespace_markup?(v) manifest_row[k] = Nokogiri::XML::Text.new(v.to_s, Nokogiri::XML('')).to_s end # Enable access with symbol or string keys manifest_row = manifest_row.with_indifferent_access # Run the XML template through ERB. This creates a new ERB object from the template XML, # NOT creating a separate thread, and omitting newlines for lines ending with '%>' template = ERB.new(template_xml, nil, '>') # ERB.result() actually computes the template. This just passes the top level binding. = template.result(binding) # The manifest_row is a hash, with column names as the key. # In the template, as a convenience we allow users to put specific column placeholders inside # double brackets: "blah [[column_name]] blah". # Here we replace those placeholders with the corresponding value # from the manifest row. manifest_row.each { |k, v| .gsub!("[[#{k}]]", v.to_s.strip) } end |
#has_whitespace_markup?(str) ⇒ Boolean
Checks whether or not a string contains any
or <p> markup.
76 77 78 |
# File 'lib/modsulator.rb', line 76 def has_whitespace_markup?(str) str.match('<br>') || str.match('<br/>') || str.match('<p>') || str.match('<p/>') end |
#row_to_xml(row) ⇒ Object
Converts a single data row into a normalized MODS XML document.
161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# File 'lib/modsulator.rb', line 161 def row_to_xml(row) # Generate an XML string, then remove any text carried over from the template mods_xml = generate_xml(row) mods_xml.gsub!(/\[\[[^\]]+\]\]/, '') # Remove empty tags from when e.g. <[[sn1:p2:type]]> does not get filled in when [[sn1:p2:type]] has no value in the source spreadsheet mods_xml.gsub!(/<\s[^>]+><\/>/, '') mods_xml_doc = Nokogiri::XML(mods_xml) normalizer = Stanford::Mods::Normalizer.new normalizer.normalize_document(mods_xml_doc.root) return mods_xml_doc end |
#transform_whitespace_markup(str) ⇒ String
Transforms HTML paragraph and line break markup tags to newline characters. This should be run before escaping any XML characters.
87 88 89 |
# File 'lib/modsulator.rb', line 87 def transform_whitespace_markup(str) str.gsub(/<br\/>/, '\n').gsub(/<br>/, '\n').gsub(/<p>/, '\n\n').gsub(/<p\/>/, '\n\n') end |
#validate_headers(spreadsheet_headers) ⇒ Array<String>
Checks that all the headers in the spreadsheet has a corresponding entry in the XML template.
150 151 152 153 154 |
# File 'lib/modsulator.rb', line 150 def validate_headers(spreadsheet_headers) spreadsheet_headers.reject do |header| header.nil? || header == 'sourceId' || template_xml.include?(header) end end |