Class: Stanford::Mods::DateParsing
- Inherits:
-
Object
- Object
- Stanford::Mods::DateParsing
- Defined in:
- lib/stanford-mods/date_parsing.rb
Overview
Parsing date strings TODO: this should become its own gem and/or become eclipsed by/merged with timetwister gem
When this is "gemified":
- we may want an integer or date sort field as well as lexical
- we could add methods like my_date.bc?
Constant Summary collapse
- BRACKETS_BETWEEN_DIGITS_REXEXP =
Regexp.new('\d[' + Regexp.escape('[]') + ']\d')
- CENTURY_WORD_REGEXP =
Regexp.new('(\d{1,2}).*century')
- CENTURY_4CHAR_REGEXP =
Regexp.new('(\d{1,2})[u\-]{2}')
- BC_REGEX =
Regexp.new('(\d{1,4}).*' + Regexp.escape('B.C.'))
- EARLY_NUMERIC =
Regexp.new('^\-?\d{1,3}$')
Instance Attribute Summary collapse
-
#orig_date_str ⇒ Object
readonly
Returns the value of attribute orig_date_str.
Class Method Summary collapse
-
.facet_string_from_date_str(date_str) ⇒ String?
get single facet value for date, generally an explicit year or “17th century” or “5 B.C.” returns ‘845’, not 0845.
-
.sortable_year_string_from_date_str(date_str) ⇒ String?
get String sortable value year if we can parse date_str to get a year.
-
.year_int_from_date_str(date_str) ⇒ Integer?
get year as Integer if we can parse date_str to get a year.
-
.year_int_valid?(year) ⇒ Boolean
true if the year is between -9999 and (current year + 1).
-
.year_str_valid?(year_str) ⇒ Boolean
true if the year is between -999 and (current year + 1).
Instance Method Summary collapse
-
#facet_string_for_bc ⇒ String?
get single facet value for B.C.
-
#facet_string_for_century ⇒ String?
get single facet value for century (17th century) if we have: yyuu, yy–, yy–? or xxth century pattern note that these are the only century patterns found in our actual date strings in MODS records.
-
#facet_string_for_early_numeric ⇒ Object
get single facet value for date String containing yyy, yy, y, -y, -yy, -yyy negative number strings will be changed to B.C.
-
#facet_string_from_date_str ⇒ String?
get single facet value for date, generally an explicit year or “17th century” or “5 B.C.”.
-
#initialize(date_str) ⇒ DateParsing
constructor
A new instance of DateParsing.
-
#remove_brackets ⇒ Object
removes brackets between digits such as 169 or [18]91.
-
#sortable_year_for_century ⇒ String?
get first year of century (as String) if we have: yyuu, yy–, yy–? or xxth century pattern note that these are the only century patterns found in our actual date strings in MODS records.
-
#sortable_year_for_decade ⇒ String?
get first year of decade (as String) if we have: yyyu, yyy-, yyy? or yyyx pattern note that these are the only decade patterns found in our actual date strings in MODS records.
-
#sortable_year_for_yy ⇒ String?
returns 4 digit year as String if we have a x/x/yy or x-x-yy pattern note that these are the only 2 digit year patterns found in our actual date strings in MODS records we use 20 as century digits unless it is greater than current year: 1/1/15 -> 2015 1/1/25 -> 1925.
-
#sortable_year_for_yyyy ⇒ String?
looks for 4 consecutive digits in orig_date_str and returns first occurrence if found.
-
#sortable_year_for_yyyy_yy_or_decade ⇒ String?
get String sortable value year if we can parse date_str to get a year.
-
#sortable_year_int_for_bc ⇒ Integer?
get Integer sortable value for B.C.
-
#sortable_year_int_for_early_numeric ⇒ Integer?
get Integer sortable value from date String containing yyy, yy, y, -y, -yy, -yyy, -yyyy.
-
#sortable_year_str_for_bc ⇒ String?
get String sortable value for B.C.
-
#sortable_year_str_for_early_numeric ⇒ String?
get String sortable value from date String containing yyy, yy, y, -y, -yy, -yyy note that these values must lexically sort to create a chronological sort.
-
#sortable_year_string_from_date_str ⇒ String?
get String sortable value year if we can parse date_str to get a year.
-
#year_int_from_date_str ⇒ Integer?
get Integer year if we can parse date_str to get a year.
-
#year_via_ruby_parsing ⇒ String?
NOTE: while Date.parse() works for many dates, the *sortable_year_for_yyyy actually works for nearly all those cases and a lot more besides.
Constructor Details
#initialize(date_str) ⇒ DateParsing
Returns a new instance of DateParsing.
52 53 54 55 |
# File 'lib/stanford-mods/date_parsing.rb', line 52 def initialize(date_str) @orig_date_str = date_str @orig_date_str.freeze end |
Instance Attribute Details
#orig_date_str ⇒ Object (readonly)
Returns the value of attribute orig_date_str.
50 51 52 |
# File 'lib/stanford-mods/date_parsing.rb', line 50 def orig_date_str @orig_date_str end |
Class Method Details
.facet_string_from_date_str(date_str) ⇒ String?
get single facet value for date, generally an explicit year or “17th century” or “5 B.C.”
returns '845', not 0845
14 15 16 |
# File 'lib/stanford-mods/date_parsing.rb', line 14 def self.facet_string_from_date_str(date_str) DateParsing.new(date_str).facet_string_from_date_str end |
.sortable_year_string_from_date_str(date_str) ⇒ String?
get String sortable value year if we can parse date_str to get a year.
SearchWorks currently uses a string field for pub date sorting; thus so does Spotlight.
The values returned must *lexically* sort in chronological order, so the B.C. dates are tricky
31 32 33 |
# File 'lib/stanford-mods/date_parsing.rb', line 31 def self.sortable_year_string_from_date_str(date_str) DateParsing.new(date_str).sortable_year_string_from_date_str end |
.year_int_from_date_str(date_str) ⇒ Integer?
get year as Integer if we can parse date_str to get a year.
21 22 23 |
# File 'lib/stanford-mods/date_parsing.rb', line 21 def self.year_int_from_date_str(date_str) DateParsing.new(date_str).year_int_from_date_str end |
.year_int_valid?(year) ⇒ Boolean
true if the year is between -9999 and (current year + 1)
45 46 47 48 |
# File 'lib/stanford-mods/date_parsing.rb', line 45 def self.year_int_valid?(year) return false unless year.is_a? Integer (-1000 < year.to_i) && (year < Date.today.year + 2) end |
.year_str_valid?(year_str) ⇒ Boolean
true if the year is between -999 and (current year + 1)
38 39 40 41 |
# File 'lib/stanford-mods/date_parsing.rb', line 38 def self.year_str_valid?(year_str) return false unless year_str && (year_str.match(/^\d{1,4}$/) || year_str.match(/^-\d{1,3}$/)) (-1000 < year_str.to_i) && (year_str.to_i < Date.today.year + 2) end |
Instance Method Details
#facet_string_for_bc ⇒ String?
get single facet value for B.C. if we have B.C. pattern
233 234 235 236 |
# File 'lib/stanford-mods/date_parsing.rb', line 233 def facet_string_for_bc bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str bc_matches.to_s if bc_matches end |
#facet_string_for_century ⇒ String?
get single facet value for century (17th century) if we have: yyuu, yy–, yy–? or xxth century pattern
note that these are the only century patterns found in our actual date strings in MODS records
198 199 200 201 202 203 204 205 206 207 208 209 |
# File 'lib/stanford-mods/date_parsing.rb', line 198 def facet_string_for_century return unless orig_date_str return if orig_date_str.match(/B\.C\./) century_str_matches = orig_date_str.match(CENTURY_WORD_REGEXP) return century_str_matches.to_s if century_str_matches century_matches = orig_date_str.match(CENTURY_4CHAR_REGEXP) if century_matches require 'active_support/core_ext/integer/inflections' return "#{($1.to_i + 1).ordinalize} century" end end |
#facet_string_for_early_numeric ⇒ Object
get single facet value for date String containing yyy, yy, y, -y, -yy, -yyy
negative number strings will be changed to B.C. strings
266 267 268 269 270 271 272 |
# File 'lib/stanford-mods/date_parsing.rb', line 266 def facet_string_for_early_numeric return unless orig_date_str.match(EARLY_NUMERIC) # negative number becomes B.C. return orig_date_str[1..-1] + " B.C." if orig_date_str.match(/^\-/) # remove leading 0s from early dates orig_date_str.to_i.to_s end |
#facet_string_from_date_str ⇒ String?
get single facet value for date, generally an explicit year or “17th century” or “5 B.C.”
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
# File 'lib/stanford-mods/date_parsing.rb', line 61 def facet_string_from_date_str return if orig_date_str == '0000-00-00' # shpc collection has these useless dates # B.C. first in case there are 4 digits, e.g. 1600 B.C. return facet_string_for_bc if orig_date_str.match(BC_REGEX) result = sortable_year_for_yyyy_yy_or_decade unless result # try removing brackets between digits in case we have 169[5] or [18]91 no_brackets = remove_brackets return DateParsing.new(no_brackets).facet_string_from_date_str if no_brackets end # parsing below this line gives string inapprop for year_str_valid? unless self.class.year_str_valid?(result) result = facet_string_for_century result ||= facet_string_for_early_numeric end # remove leading 0s from early dates result = result.to_i.to_s if result && result.match(/^\d+$/) result end |
#remove_brackets ⇒ Object
removes brackets between digits such as 169 or [18]91
130 131 132 |
# File 'lib/stanford-mods/date_parsing.rb', line 130 def remove_brackets orig_date_str.delete('[]') if orig_date_str.match(BRACKETS_BETWEEN_DIGITS_REXEXP) end |
#sortable_year_for_century ⇒ String?
get first year of century (as String) if we have: yyuu, yy–, yy–? or xxth century pattern
note that these are the only century patterns found in our actual date strings in MODS records
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
# File 'lib/stanford-mods/date_parsing.rb', line 179 def sortable_year_for_century return unless orig_date_str return if orig_date_str.match(/B\.C\./) century_matches = orig_date_str.match(CENTURY_4CHAR_REGEXP) if century_matches return $1 + '00' if $1.length == 2 return '0' + $1 + '00' if $1.length == 1 end century_str_matches = orig_date_str.match(CENTURY_WORD_REGEXP) if century_str_matches yy = ($1.to_i - 1).to_s return yy + '00' if yy.length == 2 return '0' + yy + '00' if yy.length == 1 end end |
#sortable_year_for_decade ⇒ String?
get first year of decade (as String) if we have: yyyu, yyy-, yyy? or yyyx pattern
note that these are the only decade patterns found in our actual date strings in MODS records
167 168 169 170 171 |
# File 'lib/stanford-mods/date_parsing.rb', line 167 def sortable_year_for_decade decade_matches = orig_date_str.match(/\d{3}[u\-?x]/) if orig_date_str changed_to_zero = decade_matches.to_s.tr('u\-?x', '0') if decade_matches DateParsing.new(changed_to_zero).sortable_year_for_yyyy if changed_to_zero end |
#sortable_year_for_yy ⇒ String?
returns 4 digit year as String if we have a x/x/yy or x-x-yy pattern
note that these are the only 2 digit year patterns found in our actual date strings in MODS records
we use 20 as century digits unless it is greater than current year:
1/1/15 -> 2015
1/1/25 -> 1925
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/stanford-mods/date_parsing.rb', line 147 def sortable_year_for_yy return unless orig_date_str slash_matches = orig_date_str.match(/\d{1,2}\/\d{1,2}\/\d{2}/) if slash_matches date_obj = Date.strptime(orig_date_str, '%m/%d/%y') else hyphen_matches = orig_date_str.match(/\d{1,2}-\d{1,2}-\d{2}/) date_obj = Date.strptime(orig_date_str, '%m-%d-%y') if hyphen_matches end if date_obj && date_obj > Date.today date_obj = Date.new(date_obj.year - 100, date_obj.month, date_obj.mday) end date_obj.year.to_s if date_obj rescue ArgumentError nil # explicitly want nil if date won't parse end |
#sortable_year_for_yyyy ⇒ String?
looks for 4 consecutive digits in orig_date_str and returns first occurrence if found
136 137 138 139 |
# File 'lib/stanford-mods/date_parsing.rb', line 136 def sortable_year_for_yyyy matches = orig_date_str.match(/\d{4}/) if orig_date_str matches.to_s if matches end |
#sortable_year_for_yyyy_yy_or_decade ⇒ String?
get String sortable value year if we can parse date_str to get a year.
121 122 123 124 125 126 127 |
# File 'lib/stanford-mods/date_parsing.rb', line 121 def sortable_year_for_yyyy_yy_or_decade # most date strings have a four digit year result = sortable_year_for_yyyy result ||= sortable_year_for_yy # 19xx or 20xx result ||= sortable_year_for_decade # 19xx or 20xx result end |
#sortable_year_int_for_bc ⇒ Integer?
get Integer sortable value for B.C. if we have B.C. pattern
226 227 228 229 |
# File 'lib/stanford-mods/date_parsing.rb', line 226 def sortable_year_int_for_bc bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str "-#{$1}".to_i if bc_matches end |
#sortable_year_int_for_early_numeric ⇒ Integer?
get Integer sortable value from date String containing yyy, yy, y, -y, -yy, -yyy, -yyyy
259 260 261 262 |
# File 'lib/stanford-mods/date_parsing.rb', line 259 def sortable_year_int_for_early_numeric return orig_date_str.to_i if orig_date_str.match(EARLY_NUMERIC) orig_date_str.to_i if orig_date_str.match(/^-\d{4}$/) end |
#sortable_year_str_for_bc ⇒ String?
get String sortable value for B.C. if we have B.C. pattern
note that these values must *lexically* sort to create a chronological sort.
We know our data does not contain B.C. dates older than 999, so we can make them
lexically sort by subtracting 1000. So we get:
-700 for 300 B.C., -750 for 250 B.C., -800 for 200 B.C., -801 for 199 B.C.
219 220 221 222 |
# File 'lib/stanford-mods/date_parsing.rb', line 219 def sortable_year_str_for_bc bc_matches = orig_date_str.match(BC_REGEX) if orig_date_str ($1.to_i - 1000).to_s if bc_matches end |
#sortable_year_str_for_early_numeric ⇒ String?
get String sortable value from date String containing yyy, yy, y, -y, -yy, -yyy
note that these values must *lexically* sort to create a chronological sort.
We know our data does not contain negative dates older than -999, so we can make them
lexically sort by subtracting 1000. So we get:
-983 for -17, -999 for -1, 0000 for 0, 0001 for 1, 0017 for 17
246 247 248 249 250 251 252 253 254 255 |
# File 'lib/stanford-mods/date_parsing.rb', line 246 def sortable_year_str_for_early_numeric return unless orig_date_str.match(EARLY_NUMERIC) if orig_date_str.match(/^\-/) # negative number becomes x - 1000 for sorting; -005 for -995 num = orig_date_str[1..-1].to_i - 1000 return '-' + num.to_s[1..-1].rjust(3, '0') else return orig_date_str.rjust(4, '0') end end |
#sortable_year_string_from_date_str ⇒ String?
get String sortable value year if we can parse date_str to get a year.
SearchWorks currently uses a string field for pub date sorting; thus so does Spotlight.
The values returned must *lexically* sort in chronological order, so the B.C. dates are tricky
103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/stanford-mods/date_parsing.rb', line 103 def sortable_year_string_from_date_str return if orig_date_str == '0000-00-00' # shpc collection has these useless dates # B.C. first in case there are 4 digits, e.g. 1600 B.C. return sortable_year_str_for_bc if orig_date_str.match(BC_REGEX) result = sortable_year_for_yyyy_yy_or_decade result ||= sortable_year_for_century result ||= sortable_year_str_for_early_numeric unless result # try removing brackets between digits in case we have 169[5] or [18]91 no_brackets = remove_brackets return DateParsing.new(no_brackets).sortable_year_string_from_date_str if no_brackets end result if self.class.year_str_valid?(result) end |
#year_int_from_date_str ⇒ Integer?
get Integer year if we can parse date_str to get a year.
83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
# File 'lib/stanford-mods/date_parsing.rb', line 83 def year_int_from_date_str return if orig_date_str == '0000-00-00' # shpc collection has these useless dates # B.C. first in case there are 4 digits, e.g. 1600 B.C. return sortable_year_int_for_bc if orig_date_str.match(BC_REGEX) result = sortable_year_for_yyyy_yy_or_decade result ||= sortable_year_for_century result ||= sortable_year_int_for_early_numeric unless result # try removing brackets between digits in case we have 169[5] or [18]91 no_brackets = remove_brackets return DateParsing.new(no_brackets).year_int_from_date_str if no_brackets end result.to_i if result && self.class.year_int_valid?(result.to_i) end |
#year_via_ruby_parsing ⇒ String?
NOTE: while Date.parse() works for many dates, the *sortable_year_for_yyyy
actually works for nearly all those cases and a lot more besides. Trial and error
with an extensive set of test data culled from actual date strings in our MODS records
has made this method bogus.
279 280 281 282 283 284 285 286 287 288 |
# File 'lib/stanford-mods/date_parsing.rb', line 279 def year_via_ruby_parsing return unless orig_date_str.match(/\d\d/) # need at least 2 digits # need more in string than only 2 digits return if orig_date_str.match(/^\d\d$/) || orig_date_str.match(/^\D*\d\d\D*$/) return if orig_date_str.match(/\d\s*B.C./) # skip B.C. dates date_obj = Date.parse(orig_date_str) date_obj.year.to_s rescue ArgumentError nil # explicitly want nil if date won't parse end |