Module: SanitizeUrl

Defined in:
lib/sanitize-url.rb

Overview

Helper methods in this module are module methods so that they won’t pollute the namespace into which the module is mixed in.

Constant Summary collapse

ALPHANUMERIC_CHAR_CODES =
(48..57).to_a + (65..90).to_a + (97..122).to_a
VALID_OPAQUE_SPECIAL_CHARS =
['!', '*', "'", '(', ')', ';', ':', '@', '&', '=', '+', '$', ',', '/', '?', '%', '#', '[', ']', '-', '_', '.', '~']
VALID_OPAQUE_SPECIAL_CHAR_CODES =
VALID_OPAQUE_SPECIAL_CHARS.collect { |c| c[0].is_a?(String) ? c.ord : c[0] }
VALID_OPAQUE_CHAR_CODES =
ALPHANUMERIC_CHAR_CODES + VALID_OPAQUE_SPECIAL_CHAR_CODES
VALID_SCHEME_SPECIAL_CHARS =
['+', '.', '-']
VALID_SCHEME_SPECIAL_CHAR_CODES =
VALID_SCHEME_SPECIAL_CHARS.collect { |c| c[0].is_a?(String) ? c.ord : c[0] }
VALID_SCHEME_CHAR_CODES =
ALPHANUMERIC_CHAR_CODES + VALID_SCHEME_SPECIAL_CHAR_CODES
HTTP_STYLE_SCHEMES =

Common schemes whose format should be “scheme://” instead of “scheme:”

['http', 'https', 'ftp', 'ftps', 'svn', 'svn+ssh', 'git']

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.char_or_url_encoded(code) ⇒ Object

Return either the literal char or the URL-encoded equivalent, depending on our normalization rules. Requires a decimal code point. Code point can be outside the single-byte range.



94
95
96
97
98
99
100
101
102
# File 'lib/sanitize-url.rb', line 94

def self.char_or_url_encoded(code) #:nodoc:
	if url_encode?(code)
		utf_8_str = ([code.to_i].pack('U'))
		length = utf_8_str.respond_to?(:bytes) ? utf_8_str.bytes.to_a.length : utf_8_str.length
		'%' + utf_8_str.unpack('H2' * length).join('%').upcase
	else
		code.chr
	end
end

.dereference_numerics(str) ⇒ Object

:nodoc:



80
81
82
83
84
85
86
87
88
89
# File 'lib/sanitize-url.rb', line 80

def self.dereference_numerics(str) #:nodoc:
	# Decimal code points, e.g. j &#106 j &#0000106
	str = str.gsub(/&#([a-fA-f0-9]+);?/) do
		char_or_url_encoded($1.to_i)
	end
	# Hex code points, e.g. j &#x6A
	str.gsub(/&#[xX]([a-fA-f0-9]+);?/) do
		char_or_url_encoded($1.to_i(16))
	end		
end

.url_encode?(code) ⇒ Boolean

Should we URL-encode the byte? Must receive an integer code point

Returns:

  • (Boolean)


106
107
108
109
110
111
112
113
# File 'lib/sanitize-url.rb', line 106

def self.url_encode?(code) #:nodoc:
	!(
		(code >= 48 and code <= 57)  or   # Numbers
		(code >= 65 and code <= 90)  or   # Uppercase
		(code >= 97 and code <= 122) or   # Lowercase
		VALID_OPAQUE_CHAR_CODES.include?(code)
	)
end

Instance Method Details

#sanitize_url(url, options = {}) ⇒ Object

Sanitize the URL. Example usage:

sanitize_url('javascript:alert("XSS")')
sanitize_url('javascript:alert("XSS")', :replace_evil_with => 'Replaced')
sanitize_url('ftp://example.com', :schemes => ['http', 'https'])

Raises:

  • (ArgumentError)


20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/sanitize-url.rb', line 20

def sanitize_url(url, options = {})
	raise(ArgumentError, 'options[:schemes] must be an array') if options.has_key?(:schemes) and !options[:schemes].is_a?(Array)
	options = {
		:replace_evil_with => '',
		:schemes => ['http', 'https', 'ftp', 'ftps', 'mailto', 'svn', 'svn+ssh', 'git']
	}.merge(options)
	
	url = SanitizeUrl.dereference_numerics(url)
	
	# Schemes can consist of letters, digits, or any of the following special chars: + . -
	# The scheme must begin with a letter and be terminated by a colon.
	# Everything after the scheme is opaque for our purposes. (See http://www.w3.org/DesignIssues/Axioms.html#opaque)
	
	# Try to match a URI with a scheme. We check for percent-encoded characters in the scheme.
	url.match(/^(.+?)(:|%3A)(.*)$/)
	dirty_scheme = $1
	if dirty_scheme
		unescaped_opaque = $3
		return options[:replace_evil_with] if unescaped_opaque.nil? or unescaped_opaque.empty? or unescaped_opaque.match(/^\/+$/)
	else
		# Use http as the best guest, and the rest of the URL will be considered opaque
		dirty_scheme = 'http'
		unescaped_opaque = url
	end
	# Remove URL encoding from the scheme
	dirty_scheme.gsub!(/%([a-zA-Z0-9]{2})/) do
		code = $1.to_i(16)
		VALID_SCHEME_CHAR_CODES.include?(code) ? code.chr : ''
	end
	
	# Clean the scheme by removing invalid characters
	scheme = ''
	dirty_scheme.each_byte do |code|
		scheme << code.chr if VALID_SCHEME_CHAR_CODES.include?(code)
	end
	
	# URL-encode the opaque portion as necessary. Only encode those bytes that are absolutely not allowed in URLs.
	opaque = ''
	unescaped_opaque.each_byte do |code|
		if SanitizeUrl.url_encode?(code)
			opaque << '%' << code.to_s(16).upcase
		else
			opaque << code.chr
		end
	end
	
	if options[:schemes].collect { |s| s.to_s }.include?(scheme.downcase)
		if HTTP_STYLE_SCHEMES.include?(scheme.downcase) and !opaque.match(/^\/\//)
			# It's an HTTP-like scheme, but the two slashes are missing. We'll fix that as a courtesy.
			url = scheme + '://' + opaque
		else
			# Either the scheme doesn't need the two slashes, or the opaque portion already has them.
			url = scheme + ':' + opaque
		end
		return url
	else
		return options[:replace_evil_with]
	end
end