Module: SanitizeUrl
- Defined in:
- lib/sanitize-url.rb
Overview
Helper methods in this module are module methods so that they won’t pollute the namespace into which the module is mixed in.
Constant Summary collapse
- ALPHANUMERIC_CHAR_CODES =
(48..57).to_a + (65..90).to_a + (97..122).to_a
- VALID_OPAQUE_SPECIAL_CHARS =
['!', '*', "'", '(', ')', ';', ':', '@', '&', '=', '+', '$', ',', '/', '?', '%', '#', '[', ']', '-', '_', '.', '~']
- VALID_OPAQUE_SPECIAL_CHAR_CODES =
VALID_OPAQUE_SPECIAL_CHARS.collect { |c| c[0] }
- VALID_OPAQUE_CHAR_CODES =
ALPHANUMERIC_CHAR_CODES + VALID_OPAQUE_SPECIAL_CHAR_CODES
- VALID_SCHEME_SPECIAL_CHARS =
['+', '.', '-']
- VALID_SCHEME_SPECIAL_CHAR_CODES =
VALID_SCHEME_SPECIAL_CHARS.collect { |c| c[0] }
- VALID_SCHEME_CHAR_CODES =
ALPHANUMERIC_CHAR_CODES + VALID_SCHEME_SPECIAL_CHAR_CODES
- HTTP_STYLE_SCHEMES =
Common schemes whose format should be “scheme://” instead of “scheme:”
['http', 'https', 'ftp', 'ftps', 'svn', 'svn+ssh', 'git']
Class Method Summary collapse
-
.char_or_url_encoded(code) ⇒ Object
Return either the literal char or the URL-encoded equivalent, depending on our normalization rules.
- .dereference_numerics(str) ⇒ Object
-
.url_encode?(code) ⇒ Boolean
Should we URL-encode the byte? Must receive an integer code point.
Instance Method Summary collapse
Class Method Details
.char_or_url_encoded(code) ⇒ Object
Return either the literal char or the URL-encoded equivalent, depending on our normalization rules. Requires a decimal code point. Code point can be outside the single-byte range.
90 91 92 93 94 95 96 97 |
# File 'lib/sanitize-url.rb', line 90 def self.char_or_url_encoded(code) if url_encode?(code) utf_8_str = ([code.to_i].pack('U')) '%' + utf_8_str.unpack('H2' * utf_8_str.length).join('%').upcase else code.chr end end |
.dereference_numerics(str) ⇒ Object
76 77 78 79 80 81 82 83 84 85 |
# File 'lib/sanitize-url.rb', line 76 def self.dereference_numerics(str) # Decimal code points, e.g. j j j j str = str.gsub(/&#([a-fA-f0-9]+);?/) do char_or_url_encoded($1.to_i) end # Hex code points, e.g. j j str.gsub(/&#[xX]([a-fA-f0-9]+);?/) do char_or_url_encoded($1.to_i(16)) end end |
.url_encode?(code) ⇒ Boolean
Should we URL-encode the byte? Must receive an integer code point
101 102 103 104 105 106 107 108 |
# File 'lib/sanitize-url.rb', line 101 def self.url_encode?(code) !( (code >= 48 and code <= 57) or # Numbers (code >= 65 and code <= 90) or # Uppercase (code >= 97 and code <= 122) or # Lowercase VALID_OPAQUE_CHAR_CODES.include?(code) ) end |
Instance Method Details
#sanitize_url(url, options = {}) ⇒ Object
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/sanitize-url.rb', line 16 def sanitize_url(url, = {}) raise(ArgumentError, 'options[:schemes] must be an array') if .has_key?(:schemes) and ![:schemes].is_a?(Array) = { :replace_evil_with => '', :schemes => ['http', 'https', 'ftp', 'ftps', 'mailto', 'svn', 'svn+ssh', 'git'] }.merge() url = SanitizeUrl.dereference_numerics(url) # Schemes can consist of letters, digits, or any of the following special chars: + . - # The scheme must begin with a letter and be terminated by a colon. # Everything after the scheme is opaque for our purposes. (See http://www.w3.org/DesignIssues/Axioms.html#opaque) # Try to match a URI with a scheme. We check for percent-encoded characters in the scheme. url.match(/^(.+?)(:|%3A)(.*)$/) dirty_scheme = $1 if dirty_scheme unescaped_opaque = $3 return [:replace_evil_with] if unescaped_opaque.nil? or unescaped_opaque.empty? or unescaped_opaque.match(/^\/+$/) else # Use http as the best guest, and the rest of the URL will be considered opaque dirty_scheme = 'http' unescaped_opaque = url end # Remove URL encoding from the scheme dirty_scheme.gsub!(/%([a-zA-Z0-9]{2})/) do code = $1.to_i(16) VALID_SCHEME_CHAR_CODES.include?(code) ? code.chr : '' end # Clean the scheme by removing invalid characters scheme = '' dirty_scheme.each_byte do |code| scheme << code.chr if VALID_SCHEME_CHAR_CODES.include?(code) end # URL-encode the opaque portion as necessary. Only encode those bytes that are absolutely not allowed in URLs. opaque = '' unescaped_opaque.each_byte do |code| if SanitizeUrl.url_encode?(code) opaque << '%' << code.to_s(16).upcase else opaque << code.chr end end if [:schemes].include?(scheme.downcase) if HTTP_STYLE_SCHEMES.include?(scheme.downcase) and !opaque.match(/^\/\//) # It's an HTTP-like scheme, but the two slashes are missing. We'll fix that as a courtesy. url = scheme + '://' + opaque else # Either the scheme doesn't need the two slashes, or the opaque portion already has them. url = scheme + ':' + opaque end return url else return [:replace_evil_with] end end |