Module: SanitizeUrl

Defined in:
lib/sanitize-url.rb

Overview

Helper methods in this module are module methods so that they won’t pollute the namespace into which the module is mixed in.

Constant Summary collapse

ALPHANUMERIC_CHAR_CODES =
(48..57).to_a + (65..90).to_a + (97..122).to_a
VALID_OPAQUE_SPECIAL_CHARS =
['!', '*', "'", '(', ')', ';', ':', '@', '&', '=', '+', '$', ',', '/', '?', '%', '#', '[', ']', '-', '_', '.', '~']
VALID_OPAQUE_SPECIAL_CHAR_CODES =
VALID_OPAQUE_SPECIAL_CHARS.collect { |c| c[0].is_a?(String) ? c.ord : c[0] }
VALID_OPAQUE_CHAR_CODES =
ALPHANUMERIC_CHAR_CODES + VALID_OPAQUE_SPECIAL_CHAR_CODES
VALID_SCHEME_SPECIAL_CHARS =
['+', '.', '-']
VALID_SCHEME_SPECIAL_CHAR_CODES =
VALID_SCHEME_SPECIAL_CHARS.collect { |c| c[0].is_a?(String) ? c.ord : c[0] }
VALID_SCHEME_CHAR_CODES =
ALPHANUMERIC_CHAR_CODES + VALID_SCHEME_SPECIAL_CHAR_CODES
HTTP_STYLE_SCHEMES =

Common schemes whose format should be “scheme://” instead of “scheme:”

['http', 'https', 'ftp', 'ftps', 'svn', 'svn+ssh', 'git']

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.char_or_url_encoded(code) ⇒ Object

Return either the literal char or the URL-encoded equivalent, depending on our normalization rules. Requires a decimal code point. Code point can be outside the single-byte range.



94
95
96
97
98
99
100
101
102
# File 'lib/sanitize-url.rb', line 94

def self.char_or_url_encoded(code) #:nodoc:
  if url_encode?(code)
    utf_8_str = ([code.to_i].pack('U'))
    length = utf_8_str.respond_to?(:bytes) ? utf_8_str.bytes.to_a.length : utf_8_str.length
    '%' + utf_8_str.unpack('H2' * length).join('%').upcase
  else
    code.chr
  end
end

.dereference_numerics(str) ⇒ Object

:nodoc:



80
81
82
83
84
85
86
87
88
89
# File 'lib/sanitize-url.rb', line 80

def self.dereference_numerics(str) #:nodoc:
  # Decimal code points, e.g. j &#106 j &#0000106
  str = str.gsub(/&#([a-fA-f0-9]+);?/) do
    char_or_url_encoded($1.to_i)
  end
  # Hex code points, e.g. j &#x6A
  str.gsub(/&#[xX]([a-fA-f0-9]+);?/) do
    char_or_url_encoded($1.to_i(16))
  end    
end

.url_encode?(code) ⇒ Boolean

Should we URL-encode the byte? Must receive an integer code point

Returns:

  • (Boolean)


106
107
108
109
110
111
112
113
# File 'lib/sanitize-url.rb', line 106

def self.url_encode?(code) #:nodoc:
  !(
    (code >= 48 and code <= 57)  or   # Numbers
    (code >= 65 and code <= 90)  or   # Uppercase
    (code >= 97 and code <= 122) or   # Lowercase
    VALID_OPAQUE_CHAR_CODES.include?(code)
  )
end

Instance Method Details

#sanitize_url(url, options = {}) ⇒ Object

Sanitize the URL. Example usage:

sanitize_url('javascript:alert("XSS")')
sanitize_url('javascript:alert("XSS")', :replace_evil_with => 'Replaced')
sanitize_url('ftp://example.com', :schemes => ['http', 'https'])

Raises:

  • (ArgumentError)


20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/sanitize-url.rb', line 20

def sanitize_url(url, options = {})
  raise(ArgumentError, 'options[:schemes] must be an array') if options.has_key?(:schemes) and !options[:schemes].is_a?(Array)
  options = {
    :replace_evil_with => '',
    :schemes => ['http', 'https', 'ftp', 'ftps', 'mailto', 'svn', 'svn+ssh', 'git']
  }.merge(options)
  
  url = SanitizeUrl.dereference_numerics(url)
  
  # Schemes can consist of letters, digits, or any of the following special chars: + . -
  # The scheme must begin with a letter and be terminated by a colon.
  # Everything after the scheme is opaque for our purposes. (See http://www.w3.org/DesignIssues/Axioms.html#opaque)
  
  # Try to match a URI with a scheme. We check for percent-encoded characters in the scheme.
  url.match(/^(.+?)(:|%3A)(.*)$/)
  dirty_scheme = $1
  if dirty_scheme
    unescaped_opaque = $3
    return options[:replace_evil_with] if unescaped_opaque.nil? or unescaped_opaque.empty? or unescaped_opaque.match(/^\/+$/)
  else
    # Use http as the best guest, and the rest of the URL will be considered opaque
    dirty_scheme = 'http'
    unescaped_opaque = url
  end
  # Remove URL encoding from the scheme
  dirty_scheme.gsub!(/%([a-zA-Z0-9]{2})/) do
    code = $1.to_i(16)
    VALID_SCHEME_CHAR_CODES.include?(code) ? code.chr : ''
  end
  
  # Clean the scheme by removing invalid characters
  scheme = ''
  dirty_scheme.each_byte do |code|
    scheme << code.chr if VALID_SCHEME_CHAR_CODES.include?(code)
  end
  
  # URL-encode the opaque portion as necessary. Only encode those bytes that are absolutely not allowed in URLs.
  opaque = ''
  unescaped_opaque.each_byte do |code|
    if SanitizeUrl.url_encode?(code)
      opaque << '%' << code.to_s(16).upcase
    else
      opaque << code.chr
    end
  end
  
  if options[:schemes].collect { |s| s.to_s }.include?(scheme.downcase)
    if HTTP_STYLE_SCHEMES.include?(scheme.downcase) and !opaque.match(/^\/\//)
      # It's an HTTP-like scheme, but the two slashes are missing. We'll fix that as a courtesy.
      url = scheme + '://' + opaque
    else
      # Either the scheme doesn't need the two slashes, or the opaque portion already has them.
      url = scheme + ':' + opaque
    end
    return url
  else
    return options[:replace_evil_with]
  end
end