Class: Stanford::Mods::DateParsing
- Inherits:
-
Object
- Object
- Stanford::Mods::DateParsing
- Defined in:
- lib/stanford-mods/date_parsing.rb
Overview
Parsing date strings TODO: this should become its own gem and/or become eclipsed by/merged with timetwister gem
When this is "gemified":
- we may want an integer or date sort field as well as lexical
- we could add methods like my_date.bc?
Constant Summary collapse
- BRACKETS_BETWEEN_DIGITS_REXEXP =
Regexp.new('\d[' + Regexp.escape('[]') + ']\d')
- DECADE_4CHAR_REGEXP =
Regexp.new('(^|\D)\d{3}[u\-?x]')
- DECADE_S_REGEXP =
Regexp.new('\d{3}0\'?s')
- CENTURY_WORD_REGEXP =
Regexp.new('(\d{1,2}).*century')
- CENTURY_4CHAR_REGEXP =
Regexp.new('(\d{1,2})[u\-]{2}([^u\-]|$)')
- BC_REGEX =
Regexp.new('(\d{1,4}).*' + Regexp.escape('B.C.'))
- EARLY_NUMERIC =
Regexp.new('^\-?\d{1,3}$')
Instance Attribute Summary collapse
-
#orig_date_str ⇒ Object
readonly
Returns the value of attribute orig_date_str.
Class Method Summary collapse
-
.date_str_for_display(date_str) ⇒ String?
get display value for year, generally an explicit year or “17th century” or “5 B.C.” or “1950s” or ‘845 A.D.’.
-
.sortable_year_string_from_date_str(date_str) ⇒ String?
get String sortable value year if we can parse date_str to get a year.
-
.year_int_from_date_str(date_str) ⇒ Integer?
get year as Integer if we can parse date_str to get a year.
-
.year_int_valid?(year) ⇒ Boolean
true if the year is between -9999 and (current year + 1).
-
.year_str_valid?(year_str) ⇒ Boolean
true if the year is between -999 and (current year + 1).
Instance Method Summary collapse
-
#date_str_for_display ⇒ String?
get display value for year, generally an explicit year or “17th century” or “5 B.C.” or “1950s” or ‘845 A.D.’.
-
#display_str_for_bc ⇒ String?
get display value for B.C.
-
#display_str_for_century ⇒ String?
get display value for century (17th century) if we have: yyuu, yy–, yy–? or xxth century pattern note that these are the only century patterns found in our actual date strings in MODS records.
-
#display_str_for_decade ⇒ String?
get, e.g.
-
#display_str_for_early_numeric ⇒ Object
get display value for date String containing yyy, yy, y, -y, -yy, -yyy negative number strings will be changed to B.C.
-
#initialize(date_str) ⇒ DateParsing
constructor
A new instance of DateParsing.
-
#remove_brackets ⇒ Object
removes brackets between digits such as 169 or [18]91.
-
#sortable_year_for_century ⇒ String?
get first year of century (as String) if we have: yyuu, yy–, yy–? or xxth century pattern note that these are the only century patterns found in our actual date strings in MODS records.
-
#sortable_year_for_decade ⇒ String?
get first year of decade (as String) if we have: yyyu, yyy-, yyy? or yyyx pattern note that these are the only decade patterns found in our actual date strings in MODS records.
-
#sortable_year_for_yy ⇒ String?
returns 4 digit year as String if we have a x/x/yy or x-x-yy pattern note that these are the only 2 digit year patterns found in our actual date strings in MODS records we use 20 as century digits unless it is greater than current year: 1/1/15 -> 2015 1/1/25 -> 1925.
-
#sortable_year_for_yyyy ⇒ String?
looks for 4 consecutive digits in orig_date_str and returns first occurrence if found.
-
#sortable_year_for_yyyy_or_yy ⇒ String?
get String sortable value year if we can parse date_str to get a year.
-
#sortable_year_int_for_bc ⇒ Integer?
get Integer sortable value for B.C.
-
#sortable_year_int_for_early_numeric ⇒ Integer?
get Integer sortable value from date String containing yyy, yy, y, -y, -yy, -yyy, -yyyy.
-
#sortable_year_str_for_bc ⇒ String?
get String sortable value for B.C.
-
#sortable_year_str_for_early_numeric ⇒ String?
get String sortable value from date String containing yyy, yy, y, -y, -yy, -yyy note that these values must lexically sort to create a chronological sort.
-
#sortable_year_string_from_date_str ⇒ String?
get String sortable value year if we can parse date_str to get a year.
-
#year_int_from_date_str ⇒ Integer?
get Integer year if we can parse date_str to get a year.
-
#year_via_ruby_parsing ⇒ String?
NOTE: while Date.parse() works for many dates, the *sortable_year_for_yyyy actually works for nearly all those cases and a lot more besides.
Constructor Details
#initialize(date_str) ⇒ DateParsing
Returns a new instance of DateParsing.
47 48 49 50 |
# File 'lib/stanford-mods/date_parsing.rb', line 47 def initialize(date_str) @orig_date_str = date_str @orig_date_str.freeze end |
Instance Attribute Details
#orig_date_str ⇒ Object (readonly)
Returns the value of attribute orig_date_str.
45 46 47 |
# File 'lib/stanford-mods/date_parsing.rb', line 45 def orig_date_str @orig_date_str end |
Class Method Details
.date_str_for_display(date_str) ⇒ String?
get display value for year, generally an explicit year or “17th century” or “5 B.C.” or “1950s” or ‘845 A.D.’
11 12 13 |
# File 'lib/stanford-mods/date_parsing.rb', line 11 def self.date_str_for_display(date_str) DateParsing.new(date_str).date_str_for_display end |
.sortable_year_string_from_date_str(date_str) ⇒ String?
get String sortable value year if we can parse date_str to get a year.
SearchWorks currently uses a string field for pub date sorting; thus so does Spotlight.
The values returned must *lexically* sort in chronological order, so the B.C. dates are tricky
26 27 28 |
# File 'lib/stanford-mods/date_parsing.rb', line 26 def self.sortable_year_string_from_date_str(date_str) DateParsing.new(date_str).sortable_year_string_from_date_str end |
.year_int_from_date_str(date_str) ⇒ Integer?
get year as Integer if we can parse date_str to get a year.
17 18 19 |
# File 'lib/stanford-mods/date_parsing.rb', line 17 def self.year_int_from_date_str(date_str) DateParsing.new(date_str).year_int_from_date_str end |
.year_int_valid?(year) ⇒ Boolean
true if the year is between -9999 and (current year + 1)
40 41 42 43 |
# File 'lib/stanford-mods/date_parsing.rb', line 40 def self.year_int_valid?(year) return false unless year.is_a? Integer (-1000 < year.to_i) && (year < Date.today.year + 2) end |
.year_str_valid?(year_str) ⇒ Boolean
true if the year is between -999 and (current year + 1)
33 34 35 36 |
# File 'lib/stanford-mods/date_parsing.rb', line 33 def self.year_str_valid?(year_str) return false unless year_str && (year_str.match(/^\d{1,4}$/) || year_str.match(/^-\d{1,3}$/)) (-1000 < year_str.to_i) && (year_str.to_i < Date.today.year + 2) end |
Instance Method Details
#date_str_for_display ⇒ String?
get display value for year, generally an explicit year or “17th century” or “5 B.C.” or “1950s” or ‘845 A.D.’
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/stanford-mods/date_parsing.rb', line 56 def date_str_for_display return if orig_date_str == '0000-00-00' # shpc collection has these useless dates # B.C. first in case there are 4 digits, e.g. 1600 B.C. return display_str_for_bc if orig_date_str.match(BC_REGEX) # decade next in case there are 4 digits, e.g. 1950s return display_str_for_decade if orig_date_str.match(DECADE_4CHAR_REGEXP) || orig_date_str.match(DECADE_S_REGEXP) result = sortable_year_for_yyyy_or_yy unless result # try removing brackets between digits in case we have 169[5] or [18]91 no_brackets = remove_brackets return DateParsing.new(no_brackets).date_str_for_display if no_brackets end # parsing below this line gives string inapprop for year_str_valid? unless self.class.year_str_valid?(result) result = display_str_for_century result ||= display_str_for_early_numeric end # remove leading 0s from early dates result = "#{result.to_i} A.D." if result && result.match(/^0\d+$/) result end |
#display_str_for_bc ⇒ String?
get display value for B.C. if we have B.C. pattern
250 251 252 253 |
# File 'lib/stanford-mods/date_parsing.rb', line 250 def display_str_for_bc bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str bc_matches.to_s if bc_matches end |
#display_str_for_century ⇒ String?
get display value for century (17th century) if we have: yyuu, yy–, yy–? or xxth century pattern
note that these are the only century patterns found in our actual date strings in MODS records
215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/stanford-mods/date_parsing.rb', line 215 def display_str_for_century return unless orig_date_str return if orig_date_str =~ /B\.C\./ century_str_matches = orig_date_str.match(CENTURY_WORD_REGEXP) return century_str_matches.to_s if century_str_matches century_matches = orig_date_str.match(CENTURY_4CHAR_REGEXP) if century_matches require 'active_support/core_ext/integer/inflections' return "#{($1.to_i + 1).ordinalize} century" end end |
#display_str_for_decade ⇒ String?
get, e.g. 1950s, if we have: yyyu, yyy-, yyy? or yyyx pattern or yyy0s or yyy0’s
note that these are the only decade patterns found in our actual date strings in MODS records
178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/stanford-mods/date_parsing.rb', line 178 def display_str_for_decade decade_matches = orig_date_str.match(DECADE_4CHAR_REGEXP) if orig_date_str if decade_matches changed_to_zero = decade_matches.to_s.tr('u\-?x', '0') if decade_matches zeroth_year = DateParsing.new(changed_to_zero).sortable_year_for_yyyy if changed_to_zero return "#{zeroth_year}s" if zeroth_year else decade_matches = orig_date_str.match(DECADE_S_REGEXP) if orig_date_str return decade_matches.to_s.tr("'", '') if decade_matches end end |
#display_str_for_early_numeric ⇒ Object
get display value for date String containing yyy, yy, y, -y, -yy, -yyy
negative number strings will be changed to B.C. strings
note that there is no year 0: from en.wikipedia.org/wiki/Anno_Domini “AD counting years from the start of this epoch, and BC denoting years before the start of the era. There is no year zero in this scheme, so the year AD 1 immediately follows the year 1 BC.” See also consul.stanford.edu/display/chimera/MODS+display+rules for etdf
287 288 289 290 291 292 293 294 295 |
# File 'lib/stanford-mods/date_parsing.rb', line 287 def display_str_for_early_numeric return unless orig_date_str.match(EARLY_NUMERIC) # return 1 B.C. when the date is 0 since there is no 0 year return '1 B.C.' if orig_date_str == '0' # negative number becomes B.C. return "#{orig_date_str[1..-1].to_i + 1} B.C." if orig_date_str =~ /^\-/ # remove leading 0s from early dates "#{orig_date_str.to_i} A.D." end |
#remove_brackets ⇒ Object
removes brackets between digits such as 169 or [18]91
128 129 130 |
# File 'lib/stanford-mods/date_parsing.rb', line 128 def remove_brackets orig_date_str.delete('[]') if orig_date_str.match(BRACKETS_BETWEEN_DIGITS_REXEXP) end |
#sortable_year_for_century ⇒ String?
get first year of century (as String) if we have: yyuu, yy–, yy–? or xxth century pattern
note that these are the only century patterns found in our actual date strings in MODS records
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
# File 'lib/stanford-mods/date_parsing.rb', line 196 def sortable_year_for_century return unless orig_date_str return if orig_date_str =~ /B\.C\./ century_matches = orig_date_str.match(CENTURY_4CHAR_REGEXP) if century_matches return $1 + '00' if $1.length == 2 return '0' + $1 + '00' if $1.length == 1 end century_str_matches = orig_date_str.match(CENTURY_WORD_REGEXP) if century_str_matches yy = ($1.to_i - 1).to_s return yy + '00' if yy.length == 2 return '0' + yy + '00' if yy.length == 1 end end |
#sortable_year_for_decade ⇒ String?
get first year of decade (as String) if we have: yyyu, yyy-, yyy? or yyyx pattern
note that these are the only decade patterns found in our actual date strings in MODS records
167 168 169 170 171 |
# File 'lib/stanford-mods/date_parsing.rb', line 167 def sortable_year_for_decade decade_matches = orig_date_str.match(DECADE_4CHAR_REGEXP) if orig_date_str changed_to_zero = decade_matches.to_s.tr('u\-?x', '0') if decade_matches DateParsing.new(changed_to_zero).sortable_year_for_yyyy if changed_to_zero end |
#sortable_year_for_yy ⇒ String?
returns 4 digit year as String if we have a x/x/yy or x-x-yy pattern
note that these are the only 2 digit year patterns found in our actual date strings in MODS records
we use 20 as century digits unless it is greater than current year:
1/1/15 -> 2015
1/1/25 -> 1925
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# File 'lib/stanford-mods/date_parsing.rb', line 145 def sortable_year_for_yy return unless orig_date_str slash_matches = orig_date_str.match(/\d{1,2}\/\d{1,2}\/\d{2}/) if slash_matches date_obj = Date.strptime(orig_date_str, '%m/%d/%y') else hyphen_matches = orig_date_str.match(/\d{1,2}-\d{1,2}-\d{2}/) date_obj = Date.strptime(orig_date_str, '%m-%d-%y') if hyphen_matches end if date_obj && date_obj > Date.today date_obj = Date.new(date_obj.year - 100, date_obj.month, date_obj.mday) end date_obj.year.to_s if date_obj rescue ArgumentError nil # explicitly want nil if date won't parse end |
#sortable_year_for_yyyy ⇒ String?
looks for 4 consecutive digits in orig_date_str and returns first occurrence if found
134 135 136 137 |
# File 'lib/stanford-mods/date_parsing.rb', line 134 def sortable_year_for_yyyy matches = orig_date_str.match(/\d{4}/) if orig_date_str matches.to_s if matches end |
#sortable_year_for_yyyy_or_yy ⇒ String?
get String sortable value year if we can parse date_str to get a year.
120 121 122 123 124 125 |
# File 'lib/stanford-mods/date_parsing.rb', line 120 def sortable_year_for_yyyy_or_yy # most date strings have a four digit year result = sortable_year_for_yyyy result ||= sortable_year_for_yy # 19xx or 20xx result end |
#sortable_year_int_for_bc ⇒ Integer?
get Integer sortable value for B.C. if we have B.C. pattern
243 244 245 246 |
# File 'lib/stanford-mods/date_parsing.rb', line 243 def sortable_year_int_for_bc bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str "-#{$1}".to_i if bc_matches end |
#sortable_year_int_for_early_numeric ⇒ Integer?
get Integer sortable value from date String containing yyy, yy, y, -y, -yy, -yyy, -yyyy
276 277 278 279 |
# File 'lib/stanford-mods/date_parsing.rb', line 276 def sortable_year_int_for_early_numeric return orig_date_str.to_i if orig_date_str.match(EARLY_NUMERIC) orig_date_str.to_i if orig_date_str =~ /^-\d{4}$/ end |
#sortable_year_str_for_bc ⇒ String?
get String sortable value for B.C. if we have B.C. pattern
note that these values must *lexically* sort to create a chronological sort.
We know our data does not contain B.C. dates older than 999, so we can make them
lexically sort by subtracting 1000. So we get:
-700 for 300 B.C., -750 for 250 B.C., -800 for 200 B.C., -801 for 199 B.C.
236 237 238 239 |
# File 'lib/stanford-mods/date_parsing.rb', line 236 def sortable_year_str_for_bc bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str ($1.to_i - 1000).to_s if bc_matches end |
#sortable_year_str_for_early_numeric ⇒ String?
get String sortable value from date String containing yyy, yy, y, -y, -yy, -yyy
note that these values must *lexically* sort to create a chronological sort.
We know our data does not contain negative dates older than -999, so we can make them
lexically sort by subtracting 1000. So we get:
-983 for -17, -999 for -1, 0000 for 0, 0001 for 1, 0017 for 17
263 264 265 266 267 268 269 270 271 272 |
# File 'lib/stanford-mods/date_parsing.rb', line 263 def sortable_year_str_for_early_numeric return unless orig_date_str.match(EARLY_NUMERIC) if orig_date_str =~ /^\-/ # negative number becomes x - 1000 for sorting; -005 for -995 num = orig_date_str[1..-1].to_i - 1000 return '-' + num.to_s[1..-1].rjust(3, '0') else return orig_date_str.rjust(4, '0') end end |
#sortable_year_string_from_date_str ⇒ String?
get String sortable value year if we can parse date_str to get a year.
SearchWorks currently uses a string field for pub date sorting; thus so does Spotlight.
The values returned must *lexically* sort in chronological order, so the B.C. dates are tricky
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
# File 'lib/stanford-mods/date_parsing.rb', line 101 def sortable_year_string_from_date_str return if orig_date_str == '0000-00-00' # shpc collection has these useless dates # B.C. first in case there are 4 digits, e.g. 1600 B.C. return sortable_year_str_for_bc if orig_date_str.match(BC_REGEX) result = sortable_year_for_yyyy_or_yy result ||= sortable_year_for_decade # 19xx or 20xx result ||= sortable_year_for_century result ||= sortable_year_str_for_early_numeric unless result # try removing brackets between digits in case we have 169[5] or [18]91 no_brackets = remove_brackets return DateParsing.new(no_brackets).sortable_year_string_from_date_str if no_brackets end result if self.class.year_str_valid?(result) end |
#year_int_from_date_str ⇒ Integer?
get Integer year if we can parse date_str to get a year.
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/stanford-mods/date_parsing.rb', line 80 def year_int_from_date_str return if orig_date_str == '0000-00-00' # shpc collection has these useless dates # B.C. first in case there are 4 digits, e.g. 1600 B.C. return sortable_year_int_for_bc if orig_date_str.match(BC_REGEX) result = sortable_year_for_yyyy_or_yy result ||= sortable_year_for_decade # 19xx or 20xx result ||= sortable_year_for_century result ||= sortable_year_int_for_early_numeric unless result # try removing brackets between digits in case we have 169[5] or [18]91 no_brackets = remove_brackets return DateParsing.new(no_brackets).year_int_from_date_str if no_brackets end result.to_i if result && self.class.year_int_valid?(result.to_i) end |
#year_via_ruby_parsing ⇒ String?
NOTE: while Date.parse() works for many dates, the *sortable_year_for_yyyy
actually works for nearly all those cases and a lot more besides. Trial and error
with an extensive set of test data culled from actual date strings in our MODS records
has made this method bogus.
302 303 304 305 306 307 308 309 310 311 |
# File 'lib/stanford-mods/date_parsing.rb', line 302 def year_via_ruby_parsing return unless orig_date_str =~ /\d\d/ # need at least 2 digits # need more in string than only 2 digits return if orig_date_str.match(/^\d\d$/) || orig_date_str.match(/^\D*\d\d\D*$/) return if orig_date_str =~ /\d\s*B.C./ # skip B.C. dates date_obj = Date.parse(orig_date_str) date_obj.year.to_s rescue ArgumentError nil # explicitly want nil if date won't parse end |