Module: MetadataHelper

Overview

Helper class to get keyword searchable terms from OpenURL author and title

OpenURLs have some commonly agreed upon metadata elements. This module is meant to help simplify things by sorting through the metadata and extracting what we need in a simpler interface. These values are specifically constructed from the citation to work well as keyword searches in other services.

Also includes some helpful methods for getting identifiers out in a convenient to work with way, regardless of non-standard ways they may have been stored.

Class Method Summary collapse

Instance Method Summary collapse

Methods included from MarcHelper

#add_856_links, #edition_statement, #get_title, #get_years, #gmd_values, #service_type_for_856, #should_skip_856_link?, #strip_gmd

Class Method Details

.title_is_serial?(rft) ⇒ Boolean

Look at weird bad OpenURLs, use heuristics to see if the 'title' probably represents a journal rather than a book. A guess at best, based on the bad data we've seen, sigh.

Returns:

  • (Boolean)

344
345
346
347
348
349
350
351
# File 'app/mixin_logic/metadata_helper.rb', line 344

def title_is_serial?(rft)   
  ( rft.format != "book" && rft.format != "dissertation") &&
  (  rft.["btitle"].blank?  ) &&
  ( %w{journal article}.include?(rft.["genre"]) ||
    rft.['jtitle'].present? ||
    (rft.["genre"].blank? && rft.["issn"].present?)
  )  
end

Instance Method Details

#get_doi(rft) ⇒ Object


263
264
265
# File 'app/mixin_logic/metadata_helper.rb', line 263

def get_doi(rft)
  return get_identifier(:info, "doi", rft)
end

#get_epage(rft) ⇒ Object

uses `epage` or tries to parse `pages`


329
330
331
332
333
334
335
336
337
338
339
# File 'app/mixin_logic/metadata_helper.rb', line 329

def get_epage(rft)
  if rft.['epage'].present?
    return rft.['epage']
  elsif rft.['pages'] =~ /\A.*\- *(.*) *\Z/
    return $1
  elsif rft.['pages'].present?
    return rft.['pages']
  else
    return nil
  end
end

#get_gpo_item_nums(rft) ⇒ Object

Returns an array, possibly empty.


272
273
274
275
276
277
# File 'app/mixin_logic/metadata_helper.rb', line 272

def get_gpo_item_nums(rft)
  # In a technically illegal but used by OCLC info:gpo uri
  ids = get_identifier(:info, "gpo", rft, :multiple => true)
  # Remove the uri part. 
  return ids.collect {|id| id.sub(/^info:gpo\//, '')  }
end

#get_identifier(type, sub_scheme, referent, options = {}) ⇒ Object

oclcnum, lccn, and isbn are both supposed to be stored as identifiers with an info: uri. info:oclcnum/#, info:lccn/#. But SFX sometimes stores them in the referent metadata instead: rft.lccn, rft.oclcnum. .

On the other hand, isbn and issn can legitimately be included in referent metadata or as a urn.

This method will find you an identifier accross multiple places.

type: :urn or :info subscheme: “lccn”, “oclcnum”, “isbn”, “issn”, or anything else that could be found in either a urn an info uri or a referent metadata. referent: an umlaut Referent object

returns nil if no identifier found, otherwise the bare identifier (not formatted into a urn/uri right now. Option should be maybe be added?)

Raises:

  • (Exception)

180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# File 'app/mixin_logic/metadata_helper.rb', line 180

def get_identifier(type, sub_scheme, referent, options = {} )
  options[:multiple] ||= false
  
  raise Exception.new("type must be :urn or :info") unless type == :urn or type == :info

  prefix = case type
             when :info then "info:#{sub_scheme}/"
             when :urn  then "urn:#{sub_scheme}:"
           end
  
  bare_identifier = nil
  identifiers = referent.identifiers.collect {|id| $1 if id =~ /^#{prefix}(.*)/}.compact

  if ( identifiers.blank? &&  ['lccn', 'oclcnum', 'isbn', 'issn', 'doi', 'pmid'].include?(sub_scheme) )
    # try the referent metadata
    from_rft = referent.[sub_scheme]
    identifiers = [from_rft] unless from_rft.blank?
  end

  if ( options[:multiple])
    return identifiers
  elsif ( identifiers[0].blank? )
    return nil
  else
    return identifiers[0]
  end        
  
end

#get_isbn(rft) ⇒ Object

Gets isbn, also removes any weird stuff on the end sometimes included as 'isbn', but not part of the isbn. Like (paperback) and such.


252
253
254
255
256
257
# File 'app/mixin_logic/metadata_helper.rb', line 252

def get_isbn(rft)
  isbn = get_identifier(:urn, "isbn", rft)
  isbn = isbn.gsub(/[^\dX]/, '') if isbn
  return nil if isbn.blank?
  return isbn
end

#get_issn(rft) ⇒ Object

Gets an ISSN, makes sure it's a valid ISSN or else returns nil. So will return a valid ISSN (NOT empty string) or nil.


221
222
223
224
225
# File 'app/mixin_logic/metadata_helper.rb', line 221

def get_issn(rft)
  issn = rft.['issn']
  issn = nil unless issn =~ /\d{4}(-)?\d{3}(\d|X)/
  return issn
end

#get_lccn(rft) ⇒ Object

finds and normalizes an LCCN. If multiple LCCNs are in the record, returns the first one.


211
212
213
214
215
216
217
# File 'app/mixin_logic/metadata_helper.rb', line 211

def get_lccn(rft)
  lccn = get_identifier(:info, "lccn", rft)
  
  lccn = normalize_lccn(lccn)
  
  return lccn
end

#get_month(rft) ⇒ Object


304
305
306
307
308
309
310
311
312
313
# File 'app/mixin_logic/metadata_helper.rb', line 304

def get_month(rft)
  if rft.['date'] =~ /\d\d\d\d\-(\d\d?)/
    return $1
  elsif rft.['month']
    # some link generators use an illegal 'month' parameter
    return rft.['month']
  else
    return nil
  end
end

#get_oclcnum(rft) ⇒ Object


259
260
261
# File 'app/mixin_logic/metadata_helper.rb', line 259

def get_oclcnum(rft)
  return get_identifier(:info, "oclcnum", rft)    
end

#get_pmid(rft) ⇒ Object


267
268
269
# File 'app/mixin_logic/metadata_helper.rb', line 267

def get_pmid(rft)
  return get_identifier(:info, "pmid", rft)
end

#get_search_creator(rft) ⇒ Object

chooses the best available creator for the format


135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# File 'app/mixin_logic/metadata_helper.rb', line 135

def get_search_creator(rft)
  # Just make one call to create metadata hash
   = rft.
  # Identify dc.creator query. Prefer aulast alone if available.
  creator = nil
  
  creator = ['aulast'] unless ['aulast'].blank?
  creator = ['au'] if creator.blank?
  # FIXME if capital letters are next to each other should we insert a space?
  #   Should we assume capitals next to each other are initials?
  #   Maybe only if we use au? 
  #   Logic like this makes refactoring to use Referent.to_citation less useful.
  
  # FIXME strip out commas from creator if we use au?

  return nil if creator.blank?
  
  return creator
end

#get_search_terms(rft) ⇒ Object

DEPRECATED, not flexible enough, you really need to custom fit for your given target. method that accepts a referent to return hash of common metadata elements choosing the available element for the format and the best available for searching. Wrapper around the other methods.


18
19
20
21
22
23
24
25
# File 'app/mixin_logic/metadata_helper.rb', line 18

def get_search_terms(rft)
  title = get_search_title(rft)
  creator = get_search_creator(rft)    
  
  # returns a hash of values so that more keys can be added
  # and not break services that use this module
  return {:title => title, :creator => creator}
end

#get_search_title(rft, options = {}) ⇒ Object

chooses the best available title for the format, normalizes


121
122
123
124
125
126
127
128
129
130
131
132
# File 'app/mixin_logic/metadata_helper.rb', line 121

def get_search_title(rft, options = {})
  #defaults
  options = {:remove_all_parens => true,
             :subtitle_on_semicolon => true,
             :remove_subtitle => true,
             :remove_punctuation => true}.merge(options)

  title = raw_search_title(rft)
  
  return normalize_title(title, options)
  
end

#get_spage(rft) ⇒ Object

uses `spage` or tries to parse `pages`


316
317
318
319
320
321
322
323
324
325
326
# File 'app/mixin_logic/metadata_helper.rb', line 316

def get_spage(rft)
  if rft.['spage'].present?
    return rft.['spage']
  elsif rft.['pages'] =~ /\A *(.*?) *\-.*\Z/
    return $1
  elsif rft.['pages'].present?
    return rft.['pages']
  else
    return nil
  end
end

#get_sudoc(rft) ⇒ Object


279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# File 'app/mixin_logic/metadata_helper.rb', line 279

def get_sudoc(rft)
  # Don't forget to unescape the sudoc that was escaped to maek it a uri!
  
  # Option 1: In a technically illegal but oh well info:sudoc uri
  
  sudoc = get_identifier(:info, "sudoc", rft)
  sudoc = CGI.unescape(sudoc) if sudoc

  # Option 2: rsinger's purl for sudoc. http://dilettantes.code4lib.org/2009/03/a-uri-scheme-for-sudocs/    
  unless sudoc
    sudoc = rft.identifiers.collect {|id| $1 if id =~ /^http:\/\/purl.org\/NET\/sudoc\/(.*)$/}.compact.slice(0)
    sudoc = CGI.unescape(sudoc) if sudoc
  end

  return sudoc
end

#get_top_level_creator(rft) ⇒ Object


155
156
157
158
159
160
161
162
163
164
# File 'app/mixin_logic/metadata_helper.rb', line 155

def get_top_level_creator(rft)
   # If it's a non-journal thing, add the author if we have an aulast (preferred) or au. 
  # But wait--if it's a book _part_, don't include the author name, since
  # it _might_ just be the author of the part, not of the book. 
  unless (rft.format == "journal" ||
            ( rft.format == "book" &&  ! rft.['atitle'].blank?))
     return get_search_creator(rft)
  end
  return nil
end

#get_year(rft) ⇒ Object


296
297
298
299
300
301
302
# File 'app/mixin_logic/metadata_helper.rb', line 296

def get_year(rft)
  # Some link generators use an illegal 'year' parameter    
  if (date = (rft['date'] || rft['year']))
    return date[0,4]
  end
  return nil
end

#normalize_lccn(lccn) ⇒ Object

Some normalization. See: info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:lccn/ doesn't validate right now, only normalizes. tbd, raise exception if invalid string.


231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# File 'app/mixin_logic/metadata_helper.rb', line 231

def normalize_lccn(lccn)
  if ( lccn )
    # remove whitespace
    lccn = lccn.gsub(/\s/, '')
    # remove any forward slashes and anything after them
    lccn = lccn.sub(/\/.*$/, '')
    # pad anything after a hyphen before removing hyphen, if neccesary
    lccn = lccn.sub(/-(.*)/) do |match_str| 
      if $1.length < 6 
        ("0" * (6 - $1.length)) + $1 
      else
        $1
      end
    end
  end
  return lccn
end

#normalize_title(arg_title, options = {}) ⇒ Object

A utility method to 'normalize' a title, for use when trying to match a title from one place with records in another database. Does lowercasing and removing puncutation, but also stripping out a bunch of other things that may result in false negatives. Exactly how you want to do for best results depends on the particular data you are working with, you need to experiment to see. MANY options are offered, although defaults are somewhat sensible. Much of this stuff especially takes account of titles that may have been generated from mark. Will never return the emtpy string, will sometimes return nil.


38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'app/mixin_logic/metadata_helper.rb', line 38

def normalize_title(arg_title, options = {})
  # default options
  options[:rstrip_parens] ||= true
  options[:remove_all_parens] ||= true
  options[:strip_gmd] ||= true
  options[:subtitle_on_semicolon] ||=false
  options[:remove_subtitle] ||= false
  options[:normalize_ampersand] ||= true
  options[:remove_punctuation] ||= true
  # Even if you're removing other punctuation, keep the apostrophes?
  options[:keep_apostrophes] ||=false
  
  return nil if arg_title.nil?
  title = arg_title.clone
  
  return nil if title.blank?

  # Sometimes titles given in the OpenURL have some additional stuff
  # in parens at the end, that messes up the search and isn't really
  # part of the title. Eliminate!
  title.gsub!(/\([^)]*\)\s*$/, '') if options[:rstrip_parens]
  # Or, not even just at the end, but anywhere! 
  title.gsub!(/\([^)]*\)/, '') if options[:remove_all_parens]

  # Remove things in brackets, part of an AACR2 GMD that's made it in.
  # replace with ':' so we can keep track of the fact that everything
  # that came afterwards was a sub-title like thing. 
  title = strip_gmd(title) if options[:strip_gmd]
  
  # There seems to be some catoging/metadata disagreement about when to
  # use ';' for a subtitle instead of ':'. Normalize to ':'.
  title.sub!(/[\;]/, ':') if options[:subtitle_on_semicolon]

  title.sub!(/\:(.*)$/, '') if options[:remove_subtitle]
  
  # Change ampersands to 'and' for consistency, we see it both ways.
  title.gsub!(/\&/, ' and ') if options[:normalize_ampersand]
    
  # remove non-alphanumeric, excluding apostrophe
  title.gsub!(/[^[[:alnum:]][[:space:]]\']/, ' ') if options[:remove_punctuation]

  # apostrophe not to space, just eat it.
  title.gsub!(/[\']/, '') if options[:remove_punctuation] && ! options[:keep_apostrophes]

  # compress whitespace
  title.strip!
  title.gsub!(/\s+/, ' ')

  title.downcase!
  
  title = nil if title.blank?

  return title
end

#raw_search_title(rft) ⇒ Object

pick title out of OpenURL referent from best element available, no normalization.


95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'app/mixin_logic/metadata_helper.rb', line 95

def raw_search_title(rft)
  # Just make one call to create metadata hash
   = rft.
  title = nil
  if rft.format == 'journal' && ['atitle']
    title = ['atitle']
  elsif rft.format == 'book'
    title = ['btitle'] unless ['btitle'].blank?
    title = ['title'] if title.blank?
    
  # Well, if we don't know the format and we do have a title use that.  
  # This might happen if we only have an ISBN to start and then enhance.
  # So should services like Amazon also enhance with a format, should
  # we simplify this method to not worry about format so much, or do we
  # keep this as is?
  elsif ['btitle']
    title = ['btitle']
  elsif ['title']
    title = ['title']
  elsif ['jtitle']
    title = ['jtitle']
  end
  return title
end