Class: Arachni::URI

Inherits:
Object show all
Extended by:
Arachni::UI::Output, Utilities
Includes:
Arachni::UI::Output, Utilities
Defined in:
lib/arachni/uri.rb,
lib/arachni/uri/scope.rb

Overview

The URI class automatically normalizes the URLs it is passed to parse while maintaining compatibility with Ruby’s URI core class by delegating missing methods to it – thus, you can treat it like a Ruby URI and enjoy some extra perks along the way.

It also provides cached (to maintain a low latency) helper class methods to ease common operations such as:

Author:

Defined Under Namespace

Classes: Error, Scope

Constant Summary collapse

CACHE_SIZES =
{
    parse:       600,
    ruby_parse:  600,
    fast_parse:  600,
    normalize:   1000,
    to_absolute: 1000
}
CACHE =
{
    parser:      ::URI::Parser.new,
    ruby_parse:  Support::Cache::RandomReplacement.new( CACHE_SIZES[:ruby_parse] ),
    parse:       Support::Cache::RandomReplacement.new( CACHE_SIZES[:parse] ),
    fast_parse:  Support::Cache::RandomReplacement.new( CACHE_SIZES[:fast_parse] ),
    normalize:   Support::Cache::RandomReplacement.new( CACHE_SIZES[:normalize] ),
    to_absolute: Support::Cache::RandomReplacement.new( CACHE_SIZES[:to_absolute] )
}

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Arachni::UI::Output

debug?, debug_off, debug_on, disable_only_positives, included, mute, muted?, only_positives, only_positives?, print_bad, print_debug, print_debug_backtrace, print_debug_level_1, print_debug_level_2, print_debug_level_3, print_error, print_error_backtrace, print_exception, print_info, print_line, print_ok, print_status, print_verbose, reroute_to_file, reroute_to_file?, reset_output_options, unmute, verbose?, verbose_on

Methods included from Utilities

available_port, caller_name, caller_path, cookie_decode, cookie_encode, cookies_from_document, cookies_from_file, cookies_from_response, exception_jail, exclude_path?, follow_protocol?, form_decode, form_encode, forms_from_document, forms_from_response, generate_token, get_path, hms_to_seconds, html_decode, html_encode, include_path?, links_from_document, links_from_response, normalize_url, page_from_response, page_from_url, parse_set_cookie, path_in_domain?, path_too_deep?, port_available?, rand_port, random_seed, redundant_path?, remove_constants, request_parse_body, seconds_to_hms, skip_page?, skip_path?, skip_resource?, skip_response?, uri_decode, uri_encode, uri_parse, uri_parse_query, uri_parser, uri_rewrite

Constructor Details

#initialize(url) ⇒ URI

Note:

Will discard the fragment component, if there is one.

Normalizes and parses the provided URL.

Parameters:



477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
# File 'lib/arachni/uri.rb', line 477

def initialize( url )
    @parsed_url = case url
                      when String
                          self.class.ruby_parse( url )

                      when ::URI
                          url.dup

                      when Hash
                          ::URI::Generic.build( url )

                      when Arachni::URI
                          self.parsed_url = url.parsed_url.dup

                      else
                          to_string = url.to_s rescue ''
                          msg = 'Argument must either be String, URI or Hash'
                          msg << " -- #{url.class.name} '#{to_string}' passed."
                          fail ArgumentError.new( msg )
                  end

    fail Error, 'Failed to parse URL.' if !@parsed_url

    # We probably got it from the cache, dup it to avoid corrupting the cache
    # entries.
    @parsed_url = @parsed_url.dup
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ Object

Delegates unimplemented methods to Ruby’s ‘URI::Generic` class for compatibility.



649
650
651
652
653
654
655
# File 'lib/arachni/uri.rb', line 649

def method_missing( sym, *args, &block )
    if @parsed_url.respond_to?( sym )
        @parsed_url.send( sym, *args, &block )
    else
        super
    end
end

Class Method Details

._load(url) ⇒ Object



635
636
637
# File 'lib/arachni/uri.rb', line 635

def self._load( url )
    new url
end

.addressable_parse(url) ⇒ Hash

Note:

The Hash is suitable for passing to ‘::URI::Generic.build` – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

Performs a parse using the ‘URI::Addressable` lib while normalizing the URL (will also discard the fragment).

This method is not cached and solely exists as a fallback used by fast_parse.

Parameters:

Returns:

  • (Hash)

    URL components:

    * `:scheme` -- HTTP or HTTPS
    * `:userinfo` -- `username:password`
    * `:host`
    * `:port`
    * `:path`
    * `:query`
    


337
338
339
340
341
342
343
344
345
346
347
348
# File 'lib/arachni/uri.rb', line 337

def addressable_parse( url )
    u = Addressable::URI.parse( html_decode( url.to_s ) ).normalize
    u.fragment = nil
    h = u.to_hash
    
    h[:path].gsub!( /\/+/, '/' ) if h[:path]
    if h[:user]
        h[:userinfo] = h.delete( :user )
        h[:userinfo] << ":#{h.delete( :password )}" if h[:password]
    end
    h
end

.decode(string) ⇒ String

URL decodes a string.

Parameters:

Returns:



94
95
96
# File 'lib/arachni/uri.rb', line 94

def decode( string )
    Addressable::URI.unencode( string )
end

.encode(string, bad_characters = nil) ⇒ String

URL encodes a string.

Parameters:

  • string (String)
  • bad_characters (String, Regexp) (defaults to: nil)

    Class of characters to encode – if String is passed, it should formatted as a regexp (for ‘Regexp.new`).

Returns:

  • (String)

    Encoded string.



85
86
87
# File 'lib/arachni/uri.rb', line 85

def encode( string, bad_characters = nil )
    Addressable::URI.encode_component( *[string, bad_characters].compact )
end

.fast_parse(url) ⇒ Hash

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Note:

The Hash is suitable for passing to ‘::URI::Generic.build` – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

Performs a parse that is less resource intensive than Ruby’s URI lib’s method while normalizing the URL (will also discard the fragment and path parameters).

Parameters:

Returns:

  • (Hash)

    URL components (frozen):

    * `:scheme` -- HTTP or HTTPS
    * `:userinfo` -- `username:password`
    * `:host`
    * `:port`
    * `:path`
    * `:query`
    


170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
# File 'lib/arachni/uri.rb', line 170

def fast_parse( url )
    return if !url || url.empty?
    return if url.downcase.start_with? 'javascript:'
    
    cache = CACHE[__method__]
    
    url = url.to_s.dup
    
    # Remove the fragment if there is one.
    url   = url.split( '#', 2 )[0...-1].join if url.include?( '#' )
    c_url = url.to_s.dup
    
    components = {
        scheme:   nil,
        userinfo: nil,
        host:     nil,
        port:     nil,
        path:     nil,
        query:    nil
    }
    
    valid_schemes = %w(http https)
    
    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end
    
        # We're not smart enough for scheme-less URLs and if we're to go
        # into heuristics then there's no reason to not just use
        # Addressable's parser.
        if url.start_with?( '//' )
            return cache[c_url] = addressable_parse( c_url ).freeze
        end
    
        url = url.recode
        url = html_decode( url )
    
        dupped_url = url.dup
        has_path = true
    
        splits = url.split( ':' )
        if !splits.empty? && valid_schemes.include?( splits.first.downcase )
            splits = url.split( '://', 2 )
            components[:scheme] = splits.shift
            components[:scheme].downcase! if components[:scheme]
    
            if url = splits.shift
                splits = url.to_s.split( '?' ).first.to_s.split( '@', 2 )
    
                if splits.size > 1
                    components[:userinfo] = splits.first
                    url = splits.shift
                end
    
                if !splits.empty?
                    splits = splits.last.split( '/', 2 )
                    url = splits.last
    
                    splits = splits.first.split( ':', 2 )
                    if splits.size == 2
                        host = splits.first

                        if splits.last && !splits.last.empty?
                            components[:port] = Integer( splits.last )
                        end

                        if components[:port] == 80
                            components[:port] = nil
                        end

                        url.gsub!( ':' + components[:port].to_s, '' )
                    else
                        host = splits.last
                    end
    
                    if components[:host] = host
                        url.gsub!( host, '' )
                        components[:host].downcase!
                    end
                else
                    has_path = false
                end
            else
                has_path = false
            end
        end
    
        if has_path
            splits = url.split( '?', 2 )
            if (components[:path] = splits.shift)
                if components[:scheme]
                    components[:path] = '/' + components[:path]
                end

                components[:path].gsub!( /\/+/, '/' )
    
                # Remove path params
                components[:path] = components[:path].split( ';', 2 ).first
    
                if components[:path]
                    components[:path] =
                        encode( decode( components[:path] ),
                                Addressable::URI::CharacterClasses::PATH )
    
                    components[:path] = ::URI.encode( components[:path], ';' )
                end
            end
    
            if c_url.include?( '?' ) &&
                !(query = dupped_url.split( '?', 2 ).last).empty?

                components[:query] = (query.split( '&', -1 ).map do |pair|
                    Addressable::URI.normalize_component( pair,
                        Addressable::URI::CharacterClasses::QUERY.sub( '\\&', '' )
                    )
                end).join( '&' )
            end
        end
    
        components[:path] ||= components[:scheme] ? '/' : nil
    
        components.values.each(&:freeze)
    
        cache[c_url] = components.freeze
    rescue => e
        begin
            print_debug "Failed to fast-parse '#{c_url}', falling back to slow-parse."
            print_debug "Error: #{e}"
            print_debug_backtrace( e )
    
            cache[c_url] = addressable_parse( c_url ).freeze
        rescue => ex
            print_debug "Failed to parse '#{c_url}'."
            print_debug "Error: #{ex}"
            print_debug_backtrace( ex )
    
            cache[c_url] = :err
            nil
        end
    end
end

.normalize(url) ⇒ String

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Uses fast_parse to parse and normalize the URL and then converts it to a common String format.

Parameters:

Returns:

  • (String)

    Normalized URL (frozen).



403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
# File 'lib/arachni/uri.rb', line 403

def normalize( url )
    return if !url || url.empty?
    
    cache = CACHE[__method__]
    
    url   = url.to_s.strip.dup
    c_url = url.to_s.strip.dup
    
    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end
    
        components = fast_parse( url )
    
        normalized = ''
        normalized << components[:scheme] + '://' if components[:scheme]
    
        if components[:userinfo]
            normalized << components[:userinfo]
            normalized << '@'
        end
    
        if components[:host]
            normalized << components[:host]
            normalized << ':' + components[:port].to_s if components[:port]
        end
    
        normalized << components[:path] if components[:path]
        normalized << '?' + components[:query] if components[:query]
    
        cache[c_url] = normalized.freeze
    rescue => e
        print_debug "Failed to normalize '#{c_url}'."
        print_debug "Error: #{e}"
        print_debug_backtrace( e )
    
        cache[c_url] = :err
        nil
    end
end

.parse(url) ⇒ Object

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Cached version of #initialize, if there’s a chance that the same URL will be needed to be parsed multiple times you should use this method.

See Also:



106
107
108
109
110
111
112
113
114
115
116
# File 'lib/arachni/uri.rb', line 106

def parse( url )
    return url if !url || url.is_a?( Arachni::URI )
    CACHE[__method__][url] ||= begin
        new( url )
    rescue => e
        print_debug "Failed to parse '#{url}'."
        print_debug "Error: #{e}"
        print_debug_backtrace( e )
        nil
    end
end

.parse_query(url) ⇒ Hash

Extracts inputs from a URL query.

Parameters:

Returns:



462
463
464
465
466
467
# File 'lib/arachni/uri.rb', line 462

def parse_query( url )
    parsed = parse( url )
    return {} if !parsed

    parse( url ).query_parameters
end

.parserURI::Parser

Returns cached URI parser.

Returns:

  • (URI::Parser)

    cached URI parser



72
73
74
# File 'lib/arachni/uri.rb', line 72

def parser
    CACHE[__method__]
end

.rewrite(url, rules = Arachni::Options.scope.url_rewrites) ⇒ String

Returns Rewritten URL.

Parameters:

  • url (String)
  • rules (Hash<Regexp => String>) (defaults to: Arachni::Options.scope.url_rewrites)

    Regular expression and substitution pairs.

Returns:



453
454
455
# File 'lib/arachni/uri.rb', line 453

def rewrite( url, rules = Arachni::Options.scope.url_rewrites )
    parse( url ).rewrite( rules ).to_s
end

.ruby_parse(url) ⇒ URI

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Normalizes ‘url` and uses Ruby’s core URI lib to parse it.

Parameters:

  • url (String)

    URL to parse

Returns:



128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# File 'lib/arachni/uri.rb', line 128

def ruby_parse( url )
    return url if url.to_s.empty? || url.is_a?( ::URI )
    return if url.downcase.start_with? 'javascript:'
    
    CACHE[__method__][url] ||= begin
        ::URI::Generic.build( fast_parse( url ) )
    rescue
        begin
            parser.parse( normalize( url ).dup )
        rescue => e
            print_debug "Failed to parse '#{url}'."
            print_debug "Error: #{e}"
            print_debug_backtrace( e )
            nil
        end
    end
end

.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ String

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Normalizes and converts a ‘relative` URL to an absolute one by merging in with a `reference` URL.

Pretty much a cached version of #to_absolute.

Parameters:

  • relative (String)
  • reference (String) (defaults to: Options.instance.url.to_s)

    Absolute url to use as a reference.

Returns:

  • (String)

    Absolute URL (frozen).



365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# File 'lib/arachni/uri.rb', line 365

def to_absolute( relative, reference = Options.instance.url.to_s )
    return reference if !relative || relative.empty?
    key = relative + ' :: ' + reference
    
    cache = CACHE[__method__]
    begin
        if (v = cache[key]) && v == :err
            return
        elsif v
            return v
        end
    
        parsed_ref = parse( reference )

        if relative.start_with?( '//' )
            # Scheme-less URLs are expensive to parse so let's resolve
            # the issue here.
            relative = "#{parsed_ref.scheme}:#{relative}"
        end

        cache[key] = parse( relative ).to_absolute( parsed_ref ).to_s.freeze
    rescue
        cache[key] = :err
        nil
    end
end

Instance Method Details

#==(other) ⇒ Object



510
511
512
# File 'lib/arachni/uri.rb', line 510

def ==( other )
    to_s == other.to_s
end

#_dump(_) ⇒ Object



631
632
633
# File 'lib/arachni/uri.rb', line 631

def _dump( _ )
    to_s
end

#domainString

Returns ‘domain_name.tld`.

Returns:

  • (String)

    ‘domain_name.tld`



565
566
567
568
569
570
571
572
573
# File 'lib/arachni/uri.rb', line 565

def domain
    return host if ip_address?

    s = host.split( '.' )
    return s.first if s.size == 1
    return host if s.size == 2

    s[1..-1].join( '.' )
end

#dupObject



627
628
629
# File 'lib/arachni/uri.rb', line 627

def dup
    self.class.new( to_s )
end

#hashObject



639
640
641
# File 'lib/arachni/uri.rb', line 639

def hash
    to_s.hash
end

#ip_address?Boolean

Returns ‘true` if the URI contains an IP address, `false` otherwise.

Returns:

  • (Boolean)

    ‘true` if the URI contains an IP address, `false` otherwise.



594
595
596
# File 'lib/arachni/uri.rb', line 594

def ip_address?
    !(IPAddr.new( host ) rescue nil).nil?
end

#mailto?Boolean

Returns:

  • (Boolean)


598
599
600
# File 'lib/arachni/uri.rb', line 598

def mailto?
    scheme == 'mailto'
end

#persistent_hashObject



643
644
645
# File 'lib/arachni/uri.rb', line 643

def persistent_hash
    to_s.persistent_hash
end

#query=(q) ⇒ Object



602
603
604
605
606
607
# File 'lib/arachni/uri.rb', line 602

def query=( q )
    q = q.to_s
    q = nil if q.empty?

    @parsed_url.query = q
end

#query_parametersHash

Returns Extracted inputs from a URL query.

Returns:

  • (Hash)

    Extracted inputs from a URL query.



611
612
613
614
615
616
617
618
619
620
# File 'lib/arachni/uri.rb', line 611

def query_parameters
    q = self.query
    return {} if q.to_s.empty?

    q.split( '&' ).inject( {} ) do |h, pair|
        name, value = pair.split( '=', 2 )
        h[::URI.decode( name.to_s )] = ::URI.decode( value.to_s )
        h
    end
end

#resource_extensionString

Returns The extension of the URI resource.

Returns:

  • (String)

    The extension of the URI resource.



542
543
544
545
546
# File 'lib/arachni/uri.rb', line 542

def resource_extension
    resource_name = path.split( '/' ).last.to_s
    return if !resource_name.include?( '.' )
    resource_name.split( '.' ).last
end

#respond_to?(*args) ⇒ Boolean

Returns:

  • (Boolean)


657
658
659
# File 'lib/arachni/uri.rb', line 657

def respond_to?( *args )
    super || @parsed_url.respond_to?( *args )
end

#rewrite(rules = Arachni::Options.scope.url_rewrites) ⇒ URI

Returns Rewritten URL.

Parameters:

  • rules (Hash<Regexp => String>) (defaults to: Arachni::Options.scope.url_rewrites)

    Regular expression and substitution pairs.

Returns:

  • (URI)

    Rewritten URL.



580
581
582
583
584
585
586
587
588
589
590
# File 'lib/arachni/uri.rb', line 580

def rewrite( rules = Arachni::Options.scope.url_rewrites )
    as_string = self.to_s

    rules.each do |args|
        if (rewritten = as_string.gsub( *args )) != as_string
            return Arachni::URI( rewritten )
        end
    end

    self.dup
end

#scopeScope

Returns:



506
507
508
# File 'lib/arachni/uri.rb', line 506

def scope
    @scope ||= Scope.new( self )
end

#to_absolute(reference) ⇒ Arachni::URI

Converts self into an absolute URL using ‘reference` to fill in the missing data.

Parameters:

Returns:



522
523
524
525
526
527
528
529
530
531
532
533
# File 'lib/arachni/uri.rb', line 522

def to_absolute( reference )
    absolute = case reference
                   when Arachni::URI
                       reference.parsed_url
                   when ::URI
                       reference
                   else
                       self.class.new( reference.to_s ).parsed_url
               end.merge( @parsed_url )

    self.class.new( absolute )
end

#to_sString

Returns:



623
624
625
# File 'lib/arachni/uri.rb', line 623

def to_s
    @parsed_url.to_s
end

#up_to_pathString

Returns The URL up to its path component (no resource name, query, fragment, etc).

Returns:

  • (String)

    The URL up to its path component (no resource name, query, fragment, etc).



550
551
552
553
554
555
556
557
558
559
560
561
# File 'lib/arachni/uri.rb', line 550

def up_to_path
    return if !path
    uri_path = path.dup

    uri_path = File.dirname( uri_path ) if !File.extname( path ).empty?

    uri_path << '/' if uri_path[-1] != '/'

    uri_str = "#{scheme}://#{host}"
    uri_str << ':' + port.to_s if port && port != 80
    uri_str << uri_path
end

#without_queryString

Returns The URL up to its resource component (query, fragment, etc).

Returns:

  • (String)

    The URL up to its resource component (query, fragment, etc).



537
538
539
# File 'lib/arachni/uri.rb', line 537

def without_query
    to_s.split( '?', 2 ).first.to_s
end