Class: Normalizer
- Inherits:
-
Object
- Object
- Normalizer
- Defined in:
- lib/modsulator/normalizer.rb
Overview
This class provides methods to normalize MODS XML according to the Stanford guidelines.
Constant Summary collapse
- LINEFEED =
Linefeed character entity reference
' '- LONE_DATE_XPATH =
Select all single <dateCreated> and <dateIssued> fields
'//mods:originInfo/mods:dateCreated[1][not(following-sibling::*[1][self::mods:dateCreated])] | //mods:originInfo/mods:dateIssued[1][not(following-sibling::*[1][self::mods:dateIssued])]'- DATE_CREATED_ISSUED_XPATH =
Select all <dateCreated> and <dateIssued> fields
'//mods:dateCreated | //mods:dateIssued'- MODS_NAMESPACE =
The official MODS namespace, courtesy of the Library of Congress
'http://www.loc.gov/mods/v3'- LINEFEED_XPATH =
Selects <abstract>, <tableOfContents> and <note> when no namespace is present
'//abstract | //tableOfContents | //note'- LINEFEED_XPATH_NAMESPACED =
Selects <abstract>, <tableOfContents> and <note> when a namespace is present
'//ns:abstract | //ns:tableOfContents | //ns:note'
Instance Method Summary collapse
-
#clean_date_values(nodes) ⇒ Void
Sometimes there are spurious decimal digits within the date fields.
-
#clean_linefeeds(node_list) ⇒ Void
Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by n, r,
and <br/> are all replaced by a single <p> is replaced by two </p> is removed rn is replaced by Any tags not listed above are removed. -
#clean_text(s) ⇒ String
Cleans up the text of a node:.
-
#exceptional?(node) ⇒ Boolean
Checks if a node has attributes that we make exeptions for.
-
#normalize_document(root) ⇒ Void
deprecated
Deprecated.
Use normalize_mods_document instead.
-
#normalize_mods_document(root) ⇒ Void
Normalizes the given MODS XML document according to the Stanford guidelines.
-
#normalize_xml_string(xml_string) ⇒ String
Normalizes the given XML document string according to the Stanford guidelines.
-
#remove_empty_attributes(node) ⇒ Void
Removes empty attributes from a given node.
-
#remove_empty_nodes(node) ⇒ Void
Removes empty nodes from an XML tree.
-
#substitute_linefeeds(node) ⇒ String
Recursive helper method for #clean_linefeeds to do string substitution.
-
#trim_text(node) ⇒ Void
Removes leading and trailing spaces from a node.
Instance Method Details
#clean_date_values(nodes) ⇒ Void
Sometimes there are spurious decimal digits within the date fields. This method removes any trailing decimal points within <dateCreated> and <dateIssued>.
180 181 182 183 184 |
# File 'lib/modsulator/normalizer.rb', line 180 def clean_date_values(nodes) nodes.each do |current_node| current_node.content = current_node.content.sub(/(.*)\.\d+$/, '\1') end end |
#clean_linefeeds(node_list) ⇒ Void
Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by n, r,
and <br/> are all replaced by a single <p> is replaced by two </p> is removed rn is replaced by Any tags not listed above are removed. MODS 3.5 does not allow for anything other than text inside these three nodes.
91 92 93 94 95 96 97 |
# File 'lib/modsulator/normalizer.rb', line 91 def clean_linefeeds(node_list) node_list.each do |current_node| new_text = substitute_linefeeds(current_node) current_node.children.remove current_node.content = new_text end end |
#clean_text(s) ⇒ String
Cleans up the text of a node:
-
Removes extra whitespace at the beginning and end.
-
Removes any consecutive whitespace within the string.
107 108 109 110 |
# File 'lib/modsulator/normalizer.rb', line 107 def clean_text(s) return nil unless s != nil && s != '' return s.gsub!(/\s+/, ' ').strip! end |
#exceptional?(node) ⇒ Boolean
Checks if a node has attributes that we make exeptions for. There are two such exceptions.
-
A “collection” attribute with the value “yes” on a typeOfResource tag.
-
A “manuscript” attribute with the value “yes” on a typeOfResource tag.
Nodes that fall under any of these exceptions should not be deleted, even if they have no content.
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/modsulator/normalizer.rb', line 36 def exceptional?(node) return false unless node != nil tag = node.name attributes = node.attributes return false if(attributes.empty?) attributes.each do |key, value| if(tag == 'typeOfResource') # Note that according to the MODS schema, any other value than 'yes' for these attributes is invalid if((key == 'collection' && value.to_s.downcase == 'yes') || (key == 'manuscript' && value.to_s.downcase == 'yes')) return true end end end return false end |
#normalize_document(root) ⇒ Void
Use normalize_mods_document instead.
Normalizes the given MODS XML document according to the Stanford guidelines.
210 211 212 |
# File 'lib/modsulator/normalizer.rb', line 210 def normalize_document(root) normalize_mods_document(root) end |
#normalize_mods_document(root) ⇒ Void
Normalizes the given MODS XML document according to the Stanford guidelines.
190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
# File 'lib/modsulator/normalizer.rb', line 190 def normalize_mods_document(root) node_list = [] if(root.namespace.nil?) node_list = root.xpath(LINEFEED_XPATH) else node_list = root.xpath(LINEFEED_XPATH_NAMESPACED, 'ns' => root.namespace.href) end clean_linefeeds(node_list) # Do this before deleting <br> and <p> with remove_empty_nodes() remove_empty_attributes(root) remove_empty_nodes(root) trim_text(root) clean_date_values(root.xpath(DATE_CREATED_ISSUED_XPATH, 'mods' => MODS_NAMESPACE)) end |
#normalize_xml_string(xml_string) ⇒ String
Normalizes the given XML document string according to the Stanford guidelines.
219 220 221 222 223 |
# File 'lib/modsulator/normalizer.rb', line 219 def normalize_xml_string(xml_string) doc = Nokogiri::XML(xml_string) normalize_document(doc.root) doc.to_s end |
#remove_empty_attributes(node) ⇒ Void
Removes empty attributes from a given node.
117 118 119 120 121 122 123 124 125 126 127 128 |
# File 'lib/modsulator/normalizer.rb', line 117 def remove_empty_attributes(node) children = node.children attributes = node.attributes attributes.each do |key, value| node.remove_attribute(key) if(value.to_s.strip.empty?) end children.each do |c| remove_empty_attributes(c) end end |
#remove_empty_nodes(node) ⇒ Void
Removes empty nodes from an XML tree. See #exceptional? for nodes that are kept even if empty.
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
# File 'lib/modsulator/normalizer.rb', line 135 def remove_empty_nodes(node) children = node.children if(node.text?) if(node.to_s.strip.empty?) node.remove else return end elsif(children.length > 0) children.each do |c| remove_empty_nodes(c) end end if(!exceptional?(node) && (node.children.length == 0)) node.remove end end |
#substitute_linefeeds(node) ⇒ String
Recursive helper method for #clean_linefeeds to do string substitution.
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
# File 'lib/modsulator/normalizer.rb', line 60 def substitute_linefeeds(node) new_text = String.new # If we substitute in ' ' by itself, Nokogiri interprets that and then prints '&#10;' when printing the document later. This # is an ugly way to add linefeed characters in a way that we at least get well-formatted output in the end. if(node.text?) new_text = node.content.gsub(/\r\n/, Nokogiri::HTML(LINEFEED).text).gsub(/\n/, Nokogiri::HTML(LINEFEED).text).gsub(/\r/, Nokogiri::HTML(LINEFEED).text).gsub('\\n', Nokogiri::HTML(LINEFEED).text) else if(node.node_name == 'br') new_text += Nokogiri::HTML(LINEFEED).text elsif(node.node_name == 'p') new_text += Nokogiri::HTML(LINEFEED).text + Nokogiri::HTML(LINEFEED).text end node.children.each do |c| new_text += substitute_linefeeds(c) end end return new_text end |
#trim_text(node) ⇒ Void
Removes leading and trailing spaces from a node.
162 163 164 165 166 167 168 169 170 171 172 |
# File 'lib/modsulator/normalizer.rb', line 162 def trim_text(node) children = node.children if(node.text?) node.parent.content = node.text.strip else children.each do |c| trim_text(c) end end end |