Class: String
- Inherits:
-
Object
- Object
- String
- Defined in:
- lib/string_extras.rb
Constant Summary collapse
- NONBREAKING_SPACE =
"\u00a0"
- ASCII_SPACE =
"\u0020"
- COMPOUND_NAMES =
[ 'Lane Fox', 'Bonham Carter', 'Pitt Rivers', 'Lloyd Webber', 'Sebag Montefiore', 'Holmes à Court', 'Holmes a Court', 'Baron Cohen', 'Strang Steel', 'Service Company', 'Corporation Company', 'Corporation System', 'Incorporations Limited' ]
- NAME_MODIFIERS =
[ 'Al', 'Ap', 'Ben', 'Dell[ae]', 'D[aeiou]', 'De[lrn]', 'D[ao]s', 'El', 'La', 'L[eo]', 'V[ao]n', 'Of', 'San', 'St[\.]?', 'Zur' ]
- APPROXIMATIONS =
Transliterations (like the i18n defaults) see github.com/svenfuchs/i18n/blob/master/lib/i18n/backend/transliterator.rb
{ 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 'Ð' => 'D', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', '×' => 'x', 'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'Ý' => 'Y', 'Þ' => 'Th', 'ß' => 'ss', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae', 'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ð' => 'd', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ý' => 'y', 'þ' => 'th', 'ÿ' => 'y', 'Ā' => 'A', 'ā' => 'a', 'Ă' => 'A', 'ă' => 'a', 'Ą' => 'A', 'ą' => 'a', 'Ć' => 'C', 'ć' => 'c', 'Ĉ' => 'C', 'ĉ' => 'c', 'Ċ' => 'C', 'ċ' => 'c', 'Č' => 'C', 'č' => 'c', 'Ď' => 'D', 'ď' => 'd', 'Đ' => 'D', 'đ' => 'd', 'Ē' => 'E', 'ē' => 'e', 'Ĕ' => 'E', 'ĕ' => 'e', 'Ė' => 'E', 'ė' => 'e', 'Ę' => 'E', 'ę' => 'e', 'Ě' => 'E', 'ě' => 'e', 'Ĝ' => 'G', 'ĝ' => 'g', 'Ğ' => 'G', 'ğ' => 'g', 'Ġ' => 'G', 'ġ' => 'g', 'Ģ' => 'G', 'ģ' => 'g', 'Ĥ' => 'H', 'ĥ' => 'h', 'Ħ' => 'H', 'ħ' => 'h', 'Ĩ' => 'I', 'ĩ' => 'i', 'Ī' => 'I', 'ī' => 'i', 'Ĭ' => 'I', 'ĭ' => 'i', 'Į' => 'I', 'į' => 'i', 'İ' => 'I', 'ı' => 'i', 'IJ' => 'IJ', 'ij' => 'ij', 'Ĵ' => 'J', 'ĵ' => 'j', 'Ķ' => 'K', 'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'L', 'ĺ' => 'l', 'Ļ' => 'L', 'ļ' => 'l', 'Ľ' => 'L', 'ľ' => 'l', 'Ŀ' => 'L', 'ŀ' => 'l', 'Ł' => 'L', 'ł' => 'l', 'Ń' => 'N', 'ń' => 'n', 'Ņ' => 'N', 'ņ' => 'n', 'Ň' => 'N', 'ň' => 'n', 'ʼn' => "'n", 'Ŋ' => 'NG', 'ŋ' => 'ng', 'Ō' => 'O', 'ō' => 'o', 'Ŏ' => 'O', 'ŏ' => 'o', 'Ő' => 'O', 'ő' => 'o', 'Œ' => 'OE', 'œ' => 'oe', 'Ŕ' => 'R', 'ŕ' => 'r', 'Ŗ' => 'R', 'ŗ' => 'r', 'Ř' => 'R', 'ř' => 'r', 'Ś' => 'S', 'ś' => 's', 'Ŝ' => 'S', 'ŝ' => 's', 'Ş' => 'S', 'ş' => 's', 'Š' => 'S', 'š' => 's', 'Ţ' => 'T', 'ţ' => 't', 'Ť' => 'T', 'ť' => 't', 'Ŧ' => 'T', 'ŧ' => 't', 'Ũ' => 'U', 'ũ' => 'u', 'Ū' => 'U', 'ū' => 'u', 'Ŭ' => 'U', 'ŭ' => 'u', 'Ů' => 'U', 'ů' => 'u', 'Ű' => 'U', 'ű' => 'u', 'Ų' => 'U', 'ų' => 'u', 'Ŵ' => 'W', 'ŵ' => 'w', 'Ŷ' => 'Y', 'ŷ' => 'y', 'Ÿ' => 'Y', 'Ź' => 'Z', 'ź' => 'z', 'Ż' => 'Z', 'ż' => 'z', 'Ž' => 'Z', 'ž' => 'z' }
- BAD_ENCODING =
When strings are mistakenly encoded as single-byte character sets, instead of UTF-8, there are some distinctive character combinations that we can spot and fix Useful table here www.i18nqa.com/debug/utf8-debug.html
{ '€' => '€', '‚' => '‚', 'Æ’' => 'ƒ', '„' => '„', '…' => '…', '†' => '†', '‡' => '‡', 'ˆ' => 'ˆ', '‰' => '‰', 'Å ' => 'Š', '‹' => '‹', 'Å’' => 'Œ', 'Ž' => 'Ž', '‘' => '‘', '’' => '’', '“' => '“', 'â€' => '”', # Note the invisible Ux009D in the key '′' => '′', # Manually added. Some seem to use this instead of Ux2019 '•' => '•', '–' => '–', '—' => '—', 'Ëœ' => '˜', 'â„¢' => '™', 'Å¡' => 'š', '›' => '›', 'Å“' => 'œ', 'ž' => 'ž', 'Ÿ' => 'Ÿ', ' ' => ' ', '¡' => '¡', '¢' => '¢', '£' => '£', '¤' => '¤', 'Â¥' => '¥', '¦' => '¦', '§' => '§', '¨' => '¨', '©' => '©', 'ª' => 'ª', '«' => '«', '¬' => '¬', 'Â' => '', '®' => '®', '¯' => '¯', '°' => '°', '±' => '±', '²' => '²', '³' => '³', '´' => '´', 'µ' => 'µ', '¶' => '¶', '·' => '·', '¸' => '¸', '¹' => '¹', 'º' => 'º', '»' => '»', '¼' => '¼', '½' => '½', '¾' => '¾', '¿' => '¿', 'À' => 'À', 'Ã�' => 'Á', 'Â' => 'Â', 'Ã' => 'Ã', 'Ä' => 'Ä', 'Ã…' => 'Å', 'Æ' => 'Æ', 'Ç' => 'Ç', 'È' => 'È', 'É' => 'É', 'Ê' => 'Ê', 'Ë' => 'Ë', 'ÃŒ' => 'Ì', "\xC3\x8D" => 'Í', 'ÃŽ' => 'Î', "\xC3\x8F" => 'Ï', "\xC3\x90" => 'Ð', 'Ñ' => 'Ñ', 'Ã’' => 'Ò', 'Ó' => 'Ó', 'Ô' => 'Ô', 'Õ' => 'Õ', 'Ö' => 'Ö', '×' => '×', 'Ø' => 'Ø', 'Ù' => 'Ù', 'Ú' => 'Ú', 'Û' => 'Û', 'Ü' => 'Ü', "\xC3\x9D" => 'Ý', 'Þ' => 'Þ', 'ß' => 'ß', 'à ' => 'à', 'á' => 'á', 'â' => 'â', 'ã' => 'ã', 'ä' => 'ä', 'Ã¥' => 'å', 'æ' => 'æ', 'ç' => 'ç', 'è' => 'è', 'é' => 'é', 'ê' => 'ê', 'ë' => 'ë', 'ì' => 'ì', 'Ã' => 'í', 'î' => 'î', 'ï' => 'ï', 'ð' => 'ð', 'ñ' => 'ñ', 'ò' => 'ò', 'ó' => 'ó', 'ô' => 'ô', 'õ' => 'õ', 'ö' => 'ö', '÷' => '÷', 'ø' => 'ø', 'ù' => 'ù', 'ú' => 'ú', 'û' => 'û', 'ü' => 'ü', 'ý' => 'ý', 'þ' => 'þ', 'ÿ' => 'ÿ', "\x00" => '' # Manually added to avoid Bad Argument exception }
- BAD_ENCODING_PATTERNS =
/(#{BAD_ENCODING.keys.join('|')})/
Instance Method Summary collapse
- #ansi_attributes(*args) ⇒ Object
-
#approximate_latin_chars! ⇒ Object
Any characters that resemble latin characters might usefully be transliterated into ones that are easy to type on an anglophone keyboard.
- #downcase_after_apostrophe! ⇒ Object
- #ensure_safe! ⇒ Object
- #ensure_space_after_initials! ⇒ Object
- #fix_apostrophe_modifiers! ⇒ Object
-
#fix_encoding_errors! ⇒ Object
Strings that were wrongly encoded with single-byte encodings sometimes have tell-tale substrings that we can put back into the correct UTF-8 character.
-
#fix_ff! ⇒ Object
Fix ff wierdybonks.
-
#fix_mac! ⇒ Object
Our list of terminal characters that indicate a non-celtic name used to include o but we removed it because of MacMurdo.
-
#fix_name_modifiers! ⇒ Object
Fixes for name modifiers followed by space Also replaces spaces with non-breaking spaces Fixes for name modifiers followed by an apostrophe, e.g.
-
#fix_separators!(separator) ⇒ Object
Make sure separators are not where they shouldn’t be.
-
#invalid_chars_to!(separator) ⇒ Object
Change some characters embedded in words to our separator character e.g.
-
#nbsp_in_compound_name! ⇒ Object
Fix known last names that have spaces (not hyphens!).
- #nbsp_in_name_modifier! ⇒ Object
- #remove_periods_from_initials! ⇒ Object
- #remove_spaces_from_initials! ⇒ Object
-
#safe_unescape! ⇒ Object
Unescape percent-encoded characters This might introduce UTF-8 invalid byte sequence so we take precautions.
-
#space_around_comma! ⇒ Object
Ensure commas have exactly one space after them.
- #strip_or_self! ⇒ Object
-
#strip_unwanted!(filter) ⇒ Object
Strip illegal characters out completely.
- #substitute!(pattern, replacement) ⇒ Object
- #upcase_first_letter! ⇒ Object
-
#upcase_initials! ⇒ Object
Upcase words with no vowels, e.g JPR Williams Except Ng.
-
#whitespace_to!(separator) ⇒ Object
Change any whitespace into our separator character.
Instance Method Details
#ansi_attributes(*args) ⇒ Object
265 266 267 |
# File 'lib/string_extras.rb', line 265 def ansi_attributes(*args) "\e[#{args.join(';')}m#{self}\e[0m" end |
#approximate_latin_chars! ⇒ Object
Any characters that resemble latin characters might usefully be transliterated into ones that are easy to type on an anglophone keyboard.
57 58 59 |
# File 'lib/string_extras.rb', line 57 def approximate_latin_chars! gsub!(/[^\x00-\x7f]/u) { |char| APPROXIMATIONS[char] || char } || self end |
#downcase_after_apostrophe! ⇒ Object
71 72 73 |
# File 'lib/string_extras.rb', line 71 def downcase_after_apostrophe! gsub!(/\'\w\b/) { |c| c.downcase } || self # Lowercase 's end |
#ensure_safe! ⇒ Object
160 161 162 |
# File 'lib/string_extras.rb', line 160 def ensure_safe! encode!('UTF-8', invalid: :replace, undef: :replace, replace: '') end |
#ensure_space_after_initials! ⇒ Object
156 157 158 |
# File 'lib/string_extras.rb', line 156 def ensure_space_after_initials! gsub!(/\b([a-z]\.)(?=[a-z0-9]{2,})/i) { |_| "#{Regexp.last_match[1]} " } || self end |
#fix_apostrophe_modifiers! ⇒ Object
114 115 116 117 118 119 120 |
# File 'lib/string_extras.rb', line 114 def fix_apostrophe_modifiers! %w(Dell D).each do |modifier| gsub!(/(.#{modifier}')(\w)/) { |_| "#{Regexp.last_match[1].rstrip.downcase}#{Regexp.last_match[2]}" } end self # Allows chaining end |
#fix_encoding_errors! ⇒ Object
Strings that were wrongly encoded with single-byte encodings sometimes have tell-tale substrings that we can put back into the correct UTF-8 character
63 64 65 |
# File 'lib/string_extras.rb', line 63 def fix_encoding_errors! gsub!(BAD_ENCODING_PATTERNS) { |substring| BAD_ENCODING[substring] || substring } || self end |
#fix_ff! ⇒ Object
Fix ff wierdybonks
92 93 94 95 96 97 98 |
# File 'lib/string_extras.rb', line 92 def fix_ff! %w( Fforbes Fforde Ffinch Ffrench Ffoulkes ).each { |ff_name| substitute!(ff_name, ff_name.downcase) } self # Allows chaining end |
#fix_mac! ⇒ Object
Our list of terminal characters that indicate a non-celtic name used to include o but we removed it because of MacMurdo.
77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/string_extras.rb', line 77 def fix_mac! if self =~ /\bMac[A-Za-z]{2,}[^acizj]\b/ || self =~ /\bMc/ gsub!(/\b(Ma?c)([A-Za-z]+)/) { |_| Regexp.last_match[1] + Regexp.last_match[2].capitalize } # Fix Mac exceptions %w( MacEdo MacEvicius MacHado MacHar MacHin MacHlin MacIas MacIulis MacKie MacKle MacKlin MacKmin MacKmurdo MacQuarie MacLise MacKenzie ).each { |mac_name| substitute!(/\b#{mac_name}/, mac_name.capitalize) } end self # Allows chaining end |
#fix_name_modifiers! ⇒ Object
Fixes for name modifiers followed by space Also replaces spaces with non-breaking spaces Fixes for name modifiers followed by an apostrophe, e.g. d’Artagnan, Commedia dell’Arte
103 104 105 106 107 108 109 110 111 112 |
# File 'lib/string_extras.rb', line 103 def fix_name_modifiers! NAME_MODIFIERS.each do |modifier| gsub!(/((?:[[:space:]]|^)#{modifier})([[:space:]]+|-)/) do |_| "#{Regexp.last_match[1].rstrip.downcase}#{Regexp.last_match[2].tr(ASCII_SPACE, NONBREAKING_SPACE)}" end end fix_apostrophe_modifiers! self # Allows chaining end |
#fix_separators!(separator) ⇒ Object
Make sure separators are not where they shouldn’t be
42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/string_extras.rb', line 42 def fix_separators!(separator) return self if separator.nil? || separator.empty? r = Regexp.escape(separator) # No more than one of the separator in a row. substitute!(/#{r}{2,}/, separator) # Remove leading/trailing separator. substitute!(/^#{r}|#{r}$/i, '') end |
#invalid_chars_to!(separator) ⇒ Object
Change some characters embedded in words to our separator character e.g. example.com -> example-com
24 25 26 |
# File 'lib/string_extras.rb', line 24 def invalid_chars_to!(separator) substitute!(%r{(?<![[:space:]])[\.\/](?![[:space:]])}, separator) end |
#nbsp_in_compound_name! ⇒ Object
Fix known last names that have spaces (not hyphens!)
130 131 132 133 134 135 136 |
# File 'lib/string_extras.rb', line 130 def nbsp_in_compound_name! COMPOUND_NAMES.each do |compound_name| substitute!(compound_name, compound_name.tr(ASCII_SPACE, NONBREAKING_SPACE)) end self # Allows chaining end |
#nbsp_in_name_modifier! ⇒ Object
138 139 140 141 142 143 144 |
# File 'lib/string_extras.rb', line 138 def nbsp_in_name_modifier! NAME_MODIFIERS.each do |modifier| gsub!(/([[:space:]]#{modifier})([[:space:]])/i) { |_| "#{Regexp.last_match[1]}#{NONBREAKING_SPACE}" } end self # Allows chaining end |
#remove_periods_from_initials! ⇒ Object
146 147 148 |
# File 'lib/string_extras.rb', line 146 def remove_periods_from_initials! gsub!(/\b([a-z])\./i) { |_| Regexp.last_match[1] } || self end |
#remove_spaces_from_initials! ⇒ Object
150 151 152 153 154 |
# File 'lib/string_extras.rb', line 150 def remove_spaces_from_initials! gsub!(/\b([a-z])(\.)* \b(?![a-z0-9'\u00C0-\u00FF]{2,})/i) do |_| "#{Regexp.last_match[1]}#{Regexp.last_match[2]}" end || self end |
#safe_unescape! ⇒ Object
Unescape percent-encoded characters This might introduce UTF-8 invalid byte sequence so we take precautions
31 32 33 34 35 36 37 38 39 |
# File 'lib/string_extras.rb', line 31 def safe_unescape! string = URI.unescape(self) rescue Encoding::CompatibilityError # e.g. "\u2019%80" return self else return self if self == string replace string ensure_safe! end |
#space_around_comma! ⇒ Object
Ensure commas have exactly one space after them
18 19 20 |
# File 'lib/string_extras.rb', line 18 def space_around_comma! substitute!(/[[:space:]]*,[[:space:]]*/, ', ') end |
#strip_or_self! ⇒ Object
8 9 10 |
# File 'lib/string_extras.rb', line 8 def strip_or_self! strip! || self end |
#strip_unwanted!(filter) ⇒ Object
Strip illegal characters out completely
4 5 6 |
# File 'lib/string_extras.rb', line 4 def strip_unwanted!(filter) substitute!(filter, '') end |
#substitute!(pattern, replacement) ⇒ Object
164 165 166 |
# File 'lib/string_extras.rb', line 164 def substitute!(pattern, replacement) gsub!(pattern, replacement) || self end |
#upcase_first_letter! ⇒ Object
67 68 69 |
# File 'lib/string_extras.rb', line 67 def upcase_first_letter! gsub!(/\b\w/) { |first| first.upcase } || self end |
#upcase_initials! ⇒ Object
Upcase words with no vowels, e.g JPR Williams Except Ng
124 125 126 127 |
# File 'lib/string_extras.rb', line 124 def upcase_initials! gsub!(/\b([bcdfghjklmnpqrstvwxz]+)\b/i) { |_| Regexp.last_match[1].upcase } gsub!(/\b(NG)\b/i) { |_| Regexp.last_match[1].capitalize } || self # http://en.wikipedia.org/wiki/Ng end |
#whitespace_to!(separator) ⇒ Object
Change any whitespace into our separator character
13 14 15 |
# File 'lib/string_extras.rb', line 13 def whitespace_to!(separator) substitute!(/[[:space:]]+/, separator) end |