Class: Stanford::Mods::DateParsing

Inherits:
Object
  • Object
show all
Defined in:
lib/stanford-mods/date_parsing.rb

Overview

Parsing date strings TODO: this should become its own gem and/or become eclipsed by/merged with timetwister gem

When this is "gemified":
  - we may want an integer or date sort field as well as lexical
  - we could add methods like my_date.bc?

Constant Summary collapse

BRACKETS_BETWEEN_DIGITS_REXEXP =
Regexp.new('\d[' + Regexp.escape('[]') + ']\d')
CENTURY_WORD_REGEXP =
Regexp.new('(\d{1,2}).*century')
CENTURY_4CHAR_REGEXP =
Regexp.new('(\d{1,2})[u\-]{2}')
BC_REGEX =
Regexp.new('(\d{1,4}).*' + Regexp.escape('B.C.'))
EARLY_NUMERIC =
Regexp.new('^\-?\d{1,3}$')

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(date_str) ⇒ DateParsing

Returns a new instance of DateParsing.



52
53
54
55
# File 'lib/stanford-mods/date_parsing.rb', line 52

def initialize(date_str)
  @orig_date_str = date_str
  @orig_date_str.freeze
end

Instance Attribute Details

#orig_date_strObject (readonly)

Returns the value of attribute orig_date_str.



50
51
52
# File 'lib/stanford-mods/date_parsing.rb', line 50

def orig_date_str
  @orig_date_str
end

Class Method Details

.facet_string_from_date_str(date_str) ⇒ String?

get single facet value for date, generally an explicit year or “17th century” or “5 B.C.”

returns '845', not 0845

Parameters:

  • date_str (String)

    String containing a date (we hope)

Returns:

  • (String, nil)

    String facet value for year if we could parse one, nil otherwise



14
15
16
# File 'lib/stanford-mods/date_parsing.rb', line 14

def self.facet_string_from_date_str(date_str)
  DateParsing.new(date_str).facet_string_from_date_str
end

.sortable_year_string_from_date_str(date_str) ⇒ String?

get String sortable value year if we can parse date_str to get a year.

SearchWorks currently uses a string field for pub date sorting; thus so does Spotlight.
The values returned must *lexically* sort in chronological order, so the B.C. dates are tricky

Parameters:

  • date_str (String)

    String containing a date (we hope)

Returns:

  • (String, nil)

    String sortable year if we could parse one, nil otherwise note that these values must lexically sort to create a chronological sort.



31
32
33
# File 'lib/stanford-mods/date_parsing.rb', line 31

def self.sortable_year_string_from_date_str(date_str)
  DateParsing.new(date_str).sortable_year_string_from_date_str
end

.year_int_from_date_str(date_str) ⇒ Integer?

get year as Integer if we can parse date_str to get a year.

Parameters:

  • date_str (String)

    String containing a date (we hope)

Returns:

  • (Integer, nil)

    Integer year if we could parse one, nil otherwise



21
22
23
# File 'lib/stanford-mods/date_parsing.rb', line 21

def self.year_int_from_date_str(date_str)
  DateParsing.new(date_str).year_int_from_date_str
end

.year_int_valid?(year) ⇒ Boolean

true if the year is between -9999 and (current year + 1)

Returns:

  • (Boolean)

    true if the year is between -9999 and (current year + 1); false otherwise



45
46
47
48
# File 'lib/stanford-mods/date_parsing.rb', line 45

def self.year_int_valid?(year)
  return false unless year.is_a? Integer
  (-1000 < year.to_i) && (year < Date.today.year + 2)
end

.year_str_valid?(year_str) ⇒ Boolean

true if the year is between -999 and (current year + 1)

Parameters:

  • year_str (String)

    String containing a date in format: -yyy, -yy, -y, y, yy, yyy, yyyy

Returns:

  • (Boolean)

    true if the year is between -999 and (current year + 1); false otherwise



38
39
40
41
# File 'lib/stanford-mods/date_parsing.rb', line 38

def self.year_str_valid?(year_str)
  return false unless year_str && (year_str.match(/^\d{1,4}$/) || year_str.match(/^-\d{1,3}$/))
  (-1000 < year_str.to_i) && (year_str.to_i < Date.today.year + 2)
end

Instance Method Details

#facet_string_for_bcString?

get single facet value for B.C. if we have B.C. pattern

Returns:

  • (String, nil)

    ddd B.C. if ddd B.C. in pattern; nil otherwise



233
234
235
236
# File 'lib/stanford-mods/date_parsing.rb', line 233

def facet_string_for_bc
  bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str
  bc_matches.to_s if bc_matches
end

#facet_string_for_centuryString?

get single facet value for century (17th century) if we have: yyuu, yy–, yy–? or xxth century pattern

note that these are the only century patterns found in our actual date strings in MODS records

Returns:

  • (String, nil)

    yy(th) Century if orig_date_str matches pattern, nil otherwise; also nil if B.C. in pattern



198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/stanford-mods/date_parsing.rb', line 198

def facet_string_for_century
  return unless orig_date_str
  return if orig_date_str.match(/B\.C\./)
  century_str_matches = orig_date_str.match(CENTURY_WORD_REGEXP)
  return century_str_matches.to_s if century_str_matches

  century_matches = orig_date_str.match(CENTURY_4CHAR_REGEXP)
  if century_matches
    require 'active_support/core_ext/integer/inflections'
    return "#{($1.to_i + 1).ordinalize} century"
  end
end

#facet_string_for_early_numericObject

get single facet value for date String containing yyy, yy, y, -y, -yy, -yyy

negative number strings will be changed to B.C. strings


266
267
268
269
270
271
272
# File 'lib/stanford-mods/date_parsing.rb', line 266

def facet_string_for_early_numeric
  return unless orig_date_str.match(EARLY_NUMERIC)
  # negative number becomes B.C.
  return orig_date_str[1..-1] + " B.C." if orig_date_str.match(/^\-/)
  # remove leading 0s from early dates
  orig_date_str.to_i.to_s
end

#facet_string_from_date_strString?

get single facet value for date, generally an explicit year or “17th century” or “5 B.C.”

Returns:

  • (String, nil)

    String facet value for year if we could parse one, nil otherwise



61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# File 'lib/stanford-mods/date_parsing.rb', line 61

def facet_string_from_date_str
  return if orig_date_str == '0000-00-00' # shpc collection has these useless dates
  # B.C. first in case there are 4 digits, e.g. 1600 B.C.
  return facet_string_for_bc if orig_date_str.match(BC_REGEX)
  result = sortable_year_for_yyyy_yy_or_decade
  unless result
    # try removing brackets between digits in case we have 169[5] or [18]91
    no_brackets = remove_brackets
    return DateParsing.new(no_brackets).facet_string_from_date_str if no_brackets
  end
  # parsing below this line gives string inapprop for year_str_valid?
  unless self.class.year_str_valid?(result)
    result = facet_string_for_century
    result ||= facet_string_for_early_numeric
  end
  # remove leading 0s from early dates
  result = result.to_i.to_s if result && result.match(/^\d+$/)
  result
end

#remove_bracketsObject

removes brackets between digits such as 169 or [18]91



130
131
132
# File 'lib/stanford-mods/date_parsing.rb', line 130

def remove_brackets
  orig_date_str.delete('[]') if orig_date_str.match(BRACKETS_BETWEEN_DIGITS_REXEXP)
end

#sortable_year_for_centuryString?

get first year of century (as String) if we have: yyuu, yy–, yy–? or xxth century pattern

note that these are the only century patterns found in our actual date strings in MODS records

Returns:

  • (String, nil)

    yy00 if orig_date_str matches pattern, nil otherwise; also nil if B.C. in pattern



179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# File 'lib/stanford-mods/date_parsing.rb', line 179

def sortable_year_for_century
  return unless orig_date_str
  return if orig_date_str.match(/B\.C\./)
  century_matches = orig_date_str.match(CENTURY_4CHAR_REGEXP)
  if century_matches
    return $1 + '00' if $1.length == 2
    return '0' + $1 + '00' if $1.length == 1
  end
  century_str_matches = orig_date_str.match(CENTURY_WORD_REGEXP)
  if century_str_matches
    yy = ($1.to_i - 1).to_s
    return yy + '00' if yy.length == 2
    return '0' + yy + '00' if yy.length == 1
  end
end

#sortable_year_for_decadeString?

get first year of decade (as String) if we have: yyyu, yyy-, yyy? or yyyx pattern

note that these are the only decade patterns found in our actual date strings in MODS records

Returns:

  • (String, nil)

    4 digit year (e.g. 1860, 1950) if orig_date_str matches pattern, nil otherwise



167
168
169
170
171
# File 'lib/stanford-mods/date_parsing.rb', line 167

def sortable_year_for_decade
  decade_matches = orig_date_str.match(/\d{3}[u\-?x]/) if orig_date_str
  changed_to_zero = decade_matches.to_s.tr('u\-?x', '0') if decade_matches
  DateParsing.new(changed_to_zero).sortable_year_for_yyyy if changed_to_zero
end

#sortable_year_for_yyString?

returns 4 digit year as String if we have a x/x/yy or x-x-yy pattern

note that these are the only 2 digit year patterns found in our actual date strings in MODS records
we use 20 as century digits unless it is greater than current year:
1/1/15  ->  2015
1/1/25  ->  1925

Returns:

  • (String, nil)

    4 digit year (e.g. 1865, 0950) if orig_date_str matches pattern, nil otherwise



147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# File 'lib/stanford-mods/date_parsing.rb', line 147

def sortable_year_for_yy
  return unless orig_date_str
  slash_matches = orig_date_str.match(/\d{1,2}\/\d{1,2}\/\d{2}/)
  if slash_matches
    date_obj = Date.strptime(orig_date_str, '%m/%d/%y')
  else
    hyphen_matches = orig_date_str.match(/\d{1,2}-\d{1,2}-\d{2}/)
    date_obj = Date.strptime(orig_date_str, '%m-%d-%y') if hyphen_matches
  end
  if date_obj && date_obj > Date.today
    date_obj = Date.new(date_obj.year - 100, date_obj.month, date_obj.mday)
  end
  date_obj.year.to_s if date_obj
rescue ArgumentError
  nil # explicitly want nil if date won't parse
end

#sortable_year_for_yyyyString?

looks for 4 consecutive digits in orig_date_str and returns first occurrence if found

Returns:

  • (String, nil)

    4 digit year (e.g. 1865, 0950) if orig_date_str has yyyy, nil otherwise



136
137
138
139
# File 'lib/stanford-mods/date_parsing.rb', line 136

def sortable_year_for_yyyy
  matches = orig_date_str.match(/\d{4}/) if orig_date_str
  matches.to_s if matches
end

#sortable_year_for_yyyy_yy_or_decadeString?

get String sortable value year if we can parse date_str to get a year.

Returns:

  • (String, nil)

    String sortable year if we could parse one, nil otherwise note that these values must lexically sort to create a chronological sort.



121
122
123
124
125
126
127
# File 'lib/stanford-mods/date_parsing.rb', line 121

def sortable_year_for_yyyy_yy_or_decade
  # most date strings have a four digit year
  result = sortable_year_for_yyyy
  result ||= sortable_year_for_yy # 19xx or 20xx
  result ||= sortable_year_for_decade # 19xx or 20xx
  result
end

#sortable_year_int_for_bcInteger?

get Integer sortable value for B.C. if we have B.C. pattern

Returns:

  • (Integer, nil)

    Integer sortable -ddd if B.C. in pattern; nil otherwise



226
227
228
229
# File 'lib/stanford-mods/date_parsing.rb', line 226

def sortable_year_int_for_bc
  bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str
  "-#{$1}".to_i if bc_matches
end

#sortable_year_int_for_early_numericInteger?

get Integer sortable value from date String containing yyy, yy, y, -y, -yy, -yyy, -yyyy

Returns:

  • (Integer, nil)

    Integer sortable -ddd if orig_date_str matches pattern; nil otherwise



259
260
261
262
# File 'lib/stanford-mods/date_parsing.rb', line 259

def sortable_year_int_for_early_numeric
  return orig_date_str.to_i if orig_date_str.match(EARLY_NUMERIC)
  orig_date_str.to_i if orig_date_str.match(/^-\d{4}$/)
end

#sortable_year_str_for_bcString?

get String sortable value for B.C. if we have B.C. pattern

note that these values must *lexically* sort to create a chronological sort.
We know our data does not contain B.C. dates older than 999, so we can make them
lexically sort by subtracting 1000.  So we get:
  -700 for 300 B.C., -750 for 250 B.C., -800 for 200 B.C., -801 for 199 B.C.

Returns:

  • (String, nil)

    String sortable -ddd if B.C. in pattern; nil otherwise



219
220
221
222
# File 'lib/stanford-mods/date_parsing.rb', line 219

def sortable_year_str_for_bc
  bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str
  ($1.to_i - 1000).to_s if bc_matches
end

#sortable_year_str_for_early_numericString?

get String sortable value from date String containing yyy, yy, y, -y, -yy, -yyy

note that these values must *lexically* sort to create a chronological sort.
We know our data does not contain negative dates older than -999, so we can make them
lexically sort by subtracting 1000.  So we get:
  -983 for -17, -999 for -1, 0000 for 0, 0001 for 1, 0017 for 17

Returns:

  • (String, nil)

    String sortable -ddd if orig_date_str matches pattern; nil otherwise



246
247
248
249
250
251
252
253
254
255
# File 'lib/stanford-mods/date_parsing.rb', line 246

def sortable_year_str_for_early_numeric
  return unless orig_date_str.match(EARLY_NUMERIC)
  if orig_date_str.match(/^\-/)
    # negative number becomes x - 1000 for sorting; -005 for -995
    num = orig_date_str[1..-1].to_i - 1000
    return '-' + num.to_s[1..-1].rjust(3, '0')
  else
    return orig_date_str.rjust(4, '0')
  end
end

#sortable_year_string_from_date_strString?

get String sortable value year if we can parse date_str to get a year.

SearchWorks currently uses a string field for pub date sorting; thus so does Spotlight.
The values returned must *lexically* sort in chronological order, so the B.C. dates are tricky

Returns:

  • (String, nil)

    String sortable year if we could parse one, nil otherwise note that these values must lexically sort to create a chronological sort.



103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/stanford-mods/date_parsing.rb', line 103

def sortable_year_string_from_date_str
  return if orig_date_str == '0000-00-00' # shpc collection has these useless dates
  # B.C. first in case there are 4 digits, e.g. 1600 B.C.
  return sortable_year_str_for_bc if orig_date_str.match(BC_REGEX)
  result = sortable_year_for_yyyy_yy_or_decade
  result ||= sortable_year_for_century
  result ||= sortable_year_str_for_early_numeric
  unless result
    # try removing brackets between digits in case we have 169[5] or [18]91
    no_brackets = remove_brackets
    return DateParsing.new(no_brackets).sortable_year_string_from_date_str if no_brackets
  end
  result if self.class.year_str_valid?(result)
end

#year_int_from_date_strInteger?

get Integer year if we can parse date_str to get a year.

Returns:

  • (Integer, nil)

    Integer year if we could parse one, nil otherwise



83
84
85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/stanford-mods/date_parsing.rb', line 83

def year_int_from_date_str
  return if orig_date_str == '0000-00-00' # shpc collection has these useless dates
  # B.C. first in case there are 4 digits, e.g. 1600 B.C.
  return sortable_year_int_for_bc if orig_date_str.match(BC_REGEX)
  result = sortable_year_for_yyyy_yy_or_decade
  result ||= sortable_year_for_century
  result ||= sortable_year_int_for_early_numeric
  unless result
    # try removing brackets between digits in case we have 169[5] or [18]91
    no_brackets = remove_brackets
    return DateParsing.new(no_brackets).year_int_from_date_str if no_brackets
  end
  result.to_i if result && self.class.year_int_valid?(result.to_i)
end

#year_via_ruby_parsingString?

NOTE: while Date.parse() works for many dates, the *sortable_year_for_yyyy

actually works for nearly all those cases and a lot more besides.  Trial and error
with an extensive set of test data culled from actual date strings in our MODS records
has made this method bogus.

Returns:

  • (String, nil)

    sortable 4 digit year (e.g. 1865, 0950) if orig_date_str is parseable via ruby Date, nil otherwise



279
280
281
282
283
284
285
286
287
288
# File 'lib/stanford-mods/date_parsing.rb', line 279

def year_via_ruby_parsing
  return unless orig_date_str.match(/\d\d/) # need at least 2 digits
  # need more in string than only 2 digits
  return if orig_date_str.match(/^\d\d$/) || orig_date_str.match(/^\D*\d\d\D*$/)
  return if orig_date_str.match(/\d\s*B.C./) # skip B.C. dates
  date_obj = Date.parse(orig_date_str)
  date_obj.year.to_s
rescue ArgumentError
  nil # explicitly want nil if date won't parse
end