Class: Arachni::URI

Inherits:
Object show all
Extended by:
Arachni::UI::Output, Utilities
Includes:
Arachni::UI::Output, Utilities
Defined in:
lib/arachni/uri.rb,
lib/arachni/uri/scope.rb

Overview

The URI class automatically normalizes the URLs it is passed to parse while maintaining compatibility with Ruby’s URI core class by delegating missing methods to it – thus, you can treat it like a Ruby URI and enjoy some extra perks along the way.

It also provides cached (to maintain a low latency) helper class methods to ease common operations such as:

Author:

Defined Under Namespace

Classes: Error, Scope

Constant Summary collapse

CACHE_SIZES =
{
    parse:       1000,
    ruby_parse:  1000,
    fast_parse:  1000,
    encode:      1000,
    decode:      1000,
    normalize:   1000,
    to_absolute: 1000
}
CACHE =
{
    parser: ::URI::Parser.new
}

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Arachni::UI::Output

debug?, debug_off, debug_on, disable_only_positives, included, mute, muted?, only_positives, only_positives?, print_bad, print_debug, print_debug_backtrace, print_debug_level_1, print_debug_level_2, print_debug_level_3, print_error, print_error_backtrace, print_exception, print_info, print_line, print_ok, print_status, print_verbose, reroute_to_file, reroute_to_file?, reset_output_options, unmute, verbose?, verbose_on

Methods included from Utilities

available_port, bytes_to_kilobytes, bytes_to_megabytes, caller_name, caller_path, cookie_decode, cookie_encode, cookies_from_document, cookies_from_file, cookies_from_response, exception_jail, exclude_path?, follow_protocol?, form_decode, form_encode, forms_from_document, forms_from_response, full_and_absolute_url?, generate_token, get_path, hms_to_seconds, html_decode, html_encode, include_path?, links_from_document, links_from_response, normalize_url, page_from_response, page_from_url, parse_set_cookie, path_in_domain?, path_too_deep?, port_available?, rand_port, random_seed, redundant_path?, regexp_array_match, remove_constants, request_parse_body, seconds_to_hms, skip_page?, skip_path?, skip_resource?, skip_response?, uri_decode, uri_encode, uri_parse, uri_parse_query, uri_parser, uri_rewrite

Constructor Details

#initialize(url) ⇒ URI

Note:

Will discard the fragment component, if there is one.

Normalizes and parses the provided URL.

Parameters:



494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
# File 'lib/arachni/uri.rb', line 494

def initialize( url )
    @parsed_url = case url
                      when String
                          self.class.ruby_parse( url )

                      when ::URI
                          url

                      when Hash
                          ::URI::Generic.build( url )

                      when Arachni::URI
                          self.parsed_url = url.parsed_url

                      else
                          to_string = url.to_s rescue ''
                          msg = 'Argument must either be String, URI or Hash'
                          msg << " -- #{url.class.name} '#{to_string}' passed."
                          fail ArgumentError.new( msg )
                  end

    fail Error, 'Failed to parse URL.' if !@parsed_url

    # We probably got it from the cache, dup it to avoid corrupting the cache
    # entries.
    @parsed_url = @parsed_url.dup
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ Object

Delegates unimplemented methods to Ruby’s ‘URI::Generic` class for compatibility.



675
676
677
678
679
680
681
# File 'lib/arachni/uri.rb', line 675

def method_missing( sym, *args, &block )
    if @parsed_url.respond_to?( sym )
        @parsed_url.send( sym, *args, &block )
    else
        super
    end
end

Class Method Details

._load(url) ⇒ Object



661
662
663
# File 'lib/arachni/uri.rb', line 661

def self._load( url )
    new url
end

.addressable_parse(url) ⇒ Hash

Note:

The Hash is suitable for passing to ‘::URI::Generic.build` – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

Performs a parse using the ‘URI::Addressable` lib while normalizing the URL (will also discard the fragment).

This method is not cached and solely exists as a fallback used by fast_parse.

Parameters:

Returns:

  • (Hash)

    URL components:

    * `:scheme` -- HTTP or HTTPS
    * `:userinfo` -- `username:password`
    * `:host`
    * `:port`
    * `:path`
    * `:query`
    


340
341
342
343
344
345
346
347
348
349
350
351
# File 'lib/arachni/uri.rb', line 340

def addressable_parse( url )
    u = Addressable::URI.parse( html_decode( url.to_s ) ).normalize
    u.fragment = nil
    h = u.to_hash

    h[:path].gsub!( /\/+/, '/' ) if h[:path]
    if h[:user]
        h[:userinfo] = h.delete( :user )
        h[:userinfo] << ":#{h.delete( :password )}" if h[:password]
    end
    h
end

.decode(string) ⇒ String

URL decodes a string.

Parameters:

Returns:



95
96
97
# File 'lib/arachni/uri.rb', line 95

def decode( string )
    CACHE[__method__][string] ||= Addressable::URI.unencode( string )
end

.encode(string, good_characters = nil) ⇒ String

URL encodes a string.

Parameters:

  • string (String)
  • good_characters (String, Regexp) (defaults to: nil)

    Class of characters to allow – if String is passed, it should formatted as a regexp (for ‘Regexp.new`).

Returns:

  • (String)

    Encoded string.



85
86
87
88
# File 'lib/arachni/uri.rb', line 85

def encode( string, good_characters = nil )
    CACHE[__method__][[string, good_characters]] ||=
        Addressable::URI.encode_component( *[string, good_characters].compact )
end

.fast_parse(url) ⇒ Hash

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Note:

The Hash is suitable for passing to ‘::URI::Generic.build` – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

Performs a parse that is less resource intensive than Ruby’s URI lib’s method while normalizing the URL (will also discard the fragment and path parameters).

Parameters:

Returns:

  • (Hash)

    URL components (frozen):

    * `:scheme` -- HTTP or HTTPS
    * `:userinfo` -- `username:password`
    * `:host`
    * `:port`
    * `:path`
    * `:query`
    


171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
# File 'lib/arachni/uri.rb', line 171

def fast_parse( url )
    return if !url || url.empty?
    return if url.downcase.start_with? 'javascript:'

    cache = CACHE[__method__]

    url = url.to_s.dup

    # Remove the fragment if there is one.
    if url.include?( '#' )
        url = url.split( '#', 2 )[0...-1].join
    end

    c_url = url.dup

    components = {
        scheme:   nil,
        userinfo: nil,
        host:     nil,
        port:     nil,
        path:     nil,
        query:    nil
    }

    valid_schemes = %w(http https)

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        # We're not smart enough for scheme-less URLs and if we're to go
        # into heuristics then there's no reason to not just use
        # Addressable's parser.
        if url.start_with?( '//' )
            return cache[c_url] = addressable_parse( c_url ).freeze
        end

        url = url.recode!
        url = html_decode( url )

        dupped_url = url.dup
        has_path = true

        splits = url.split( ':' )
        if !splits.empty? && valid_schemes.include?( splits.first.downcase )
            splits = url.split( '://', 2 )
            components[:scheme] = splits.shift
            components[:scheme].downcase! if components[:scheme]

            if url = splits.shift
                splits = url.to_s.split( '?' ).first.to_s.split( '@', 2 )

                if splits.size > 1
                    components[:userinfo] = splits.first
                    url = splits.shift
                end

                if !splits.empty?
                    splits = splits.last.split( '/', 2 )
                    url = splits.last

                    splits = splits.first.split( ':', 2 )
                    if splits.size == 2
                        host = splits.first

                        if splits.last && !splits.last.empty?
                            components[:port] = Integer( splits.last )
                        end

                        if components[:port] == 80
                            components[:port] = nil
                        end

                        url.gsub!( ':' + components[:port].to_s, '' )
                    else
                        host = splits.last
                    end

                    if components[:host] = host
                        url.gsub!( host, '' )
                        components[:host].downcase!
                    end
                else
                    has_path = false
                end
            else
                has_path = false
            end
        end

        if has_path
            splits = url.split( '?', 2 )
            if (components[:path] = splits.shift)
                if components[:scheme]
                    components[:path] = '/' + components[:path]
                end

                components[:path].gsub!( /\/+/, '/' )

                # Remove path params
                components[:path] = components[:path].split( ';', 2 ).first

                if components[:path]
                    components[:path] =
                        encode( decode( components[:path] ),
                                Addressable::URI::CharacterClasses::PATH )

                    components[:path] = ::URI.encode( components[:path], ';' )
                end
            end

            if c_url.include?( '?' ) &&
                !(query = dupped_url.split( '?', 2 ).last).empty?

                components[:query] = (query.split( '&', -1 ).map do |pair|
                    encode( decode( pair ),
                            Addressable::URI::CharacterClasses::QUERY.sub( '\\&', '' ) )
                end).join( '&' )
            end
        end

        components[:path] ||= components[:scheme] ? '/' : nil

        components.values.each(&:freeze)

        cache[c_url] = components.freeze
    rescue => e
        begin
            print_debug "Failed to fast-parse '#{c_url}', falling back to slow-parse."
            print_debug "Error: #{e}"
            print_debug_backtrace( e )

            cache[c_url] = addressable_parse( c_url.recode! ).freeze
        rescue => ex
            print_debug "Failed to parse '#{c_url}'."
            print_debug "Error: #{ex}"
            print_debug_backtrace( ex )

            cache[c_url] = :err
            nil
        end
    end
end

.full_and_absolute?(url) ⇒ Bool

Returns ‘true` is the URL is full and absolute, `false` otherwise.

Parameters:

  • url (String)

    URL to check.

Returns:

  • (Bool)

    ‘true` is the URL is full and absolute, `false` otherwise.



477
478
479
480
481
482
483
484
# File 'lib/arachni/uri.rb', line 477

def full_and_absolute?( url )
    return false if url.to_s.empty?

    parsed = parse( url.to_s )
    return false if !parsed

    parsed.absolute?
end

.normalize(url) ⇒ String

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Uses fast_parse to parse and normalize the URL and then converts it to a common String format.

Parameters:

Returns:

  • (String)

    Normalized URL (frozen).



406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
# File 'lib/arachni/uri.rb', line 406

def normalize( url )
    return if !url || url.empty?

    cache = CACHE[__method__]

    url   = url.to_s.strip
    c_url = url.dup

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        components = fast_parse( url )

        normalized = ''
        normalized << components[:scheme] + '://' if components[:scheme]

        if components[:userinfo]
            normalized << components[:userinfo]
            normalized << '@'
        end

        if components[:host]
            normalized << components[:host]
            normalized << ':' + components[:port].to_s if components[:port]
        end

        normalized << components[:path] if components[:path]
        normalized << '?' + components[:query] if components[:query]

        cache[c_url] = normalized.freeze
    rescue => e
        print_debug "Failed to normalize '#{c_url}'."
        print_debug "Error: #{e}"
        print_debug_backtrace( e )

        cache[c_url] = :err
        nil
    end
end

.parse(url) ⇒ Object

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Cached version of #initialize, if there’s a chance that the same URL will be needed to be parsed multiple times you should use this method.

See Also:



107
108
109
110
111
112
113
114
115
116
117
# File 'lib/arachni/uri.rb', line 107

def parse( url )
    return url if !url || url.is_a?( Arachni::URI )
    CACHE[__method__][url] ||= begin
        new( url )
    rescue => e
        print_debug "Failed to parse '#{url}'."
        print_debug "Error: #{e}"
        print_debug_backtrace( e )
        nil
    end
end

.parse_query(url) ⇒ Hash

Extracts inputs from a URL query.

Parameters:

Returns:



465
466
467
468
469
470
# File 'lib/arachni/uri.rb', line 465

def parse_query( url )
    parsed = parse( url )
    return {} if !parsed

    parse( url ).query_parameters
end

.parserURI::Parser

Returns cached URI parser.

Returns:

  • (URI::Parser)

    cached URI parser



72
73
74
# File 'lib/arachni/uri.rb', line 72

def parser
    CACHE[__method__]
end

.rewrite(url, rules = Arachni::Options.scope.url_rewrites) ⇒ String

Returns Rewritten URL.

Parameters:

  • url (String)
  • rules (Hash<Regexp => String>) (defaults to: Arachni::Options.scope.url_rewrites)

    Regular expression and substitution pairs.

Returns:



456
457
458
# File 'lib/arachni/uri.rb', line 456

def rewrite( url, rules = Arachni::Options.scope.url_rewrites )
    parse( url ).rewrite( rules ).to_s
end

.ruby_parse(url) ⇒ URI

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Normalizes ‘url` and uses Ruby’s core URI lib to parse it.

Parameters:

  • url (String)

    URL to parse

Returns:



129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# File 'lib/arachni/uri.rb', line 129

def ruby_parse( url )
    return url if url.to_s.empty? || url.is_a?( ::URI )
    return if url.downcase.start_with? 'javascript:'

    CACHE[__method__][url] ||= begin
        ::URI::Generic.build( fast_parse( url ) )
    rescue
        begin
            parser.parse( normalize( url ).dup )
        rescue => e
            print_debug "Failed to parse '#{url}'."
            print_debug "Error: #{e}"
            print_debug_backtrace( e )
            nil
        end
    end
end

.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ String

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Normalizes and converts a ‘relative` URL to an absolute one by merging in with a `reference` URL.

Pretty much a cached version of #to_absolute.

Parameters:

  • relative (String)
  • reference (String) (defaults to: Options.instance.url.to_s)

    Absolute url to use as a reference.

Returns:

  • (String)

    Absolute URL (frozen).



368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
# File 'lib/arachni/uri.rb', line 368

def to_absolute( relative, reference = Options.instance.url.to_s )
    return reference if !relative || relative.empty?
    key = relative + ' :: ' + reference

    cache = CACHE[__method__]
    begin
        if (v = cache[key]) && v == :err
            return
        elsif v
            return v
        end

        parsed_ref = parse( reference )

        if relative.start_with?( '//' )
            # Scheme-less URLs are expensive to parse so let's resolve
            # the issue here.
            relative = "#{parsed_ref.scheme}:#{relative}"
        end

        cache[key] = parse( relative ).to_absolute( parsed_ref ).to_s.freeze
    rescue
        cache[key] = :err
        nil
    end
end

Instance Method Details

#==(other) ⇒ Object



527
528
529
# File 'lib/arachni/uri.rb', line 527

def ==( other )
    to_s == other.to_s
end

#_dump(_) ⇒ Object



657
658
659
# File 'lib/arachni/uri.rb', line 657

def _dump( _ )
    to_s
end

#domainString

Returns ‘domain_name.tld`.

Returns:

  • (String)

    ‘domain_name.tld`



590
591
592
593
594
595
596
597
598
599
# File 'lib/arachni/uri.rb', line 590

def domain
    return if !host
    return host if ip_address?

    s = host.split( '.' )
    return s.first if s.size == 1
    return host if s.size == 2

    s[1..-1].join( '.' )
end

#dupObject



653
654
655
# File 'lib/arachni/uri.rb', line 653

def dup
    self.class.new( to_s )
end

#hashObject



665
666
667
# File 'lib/arachni/uri.rb', line 665

def hash
    to_s.hash
end

#ip_address?Boolean

Returns ‘true` if the URI contains an IP address, `false` otherwise.

Returns:

  • (Boolean)

    ‘true` if the URI contains an IP address, `false` otherwise.



620
621
622
# File 'lib/arachni/uri.rb', line 620

def ip_address?
    !(IPAddr.new( host ) rescue nil).nil?
end

#mailto?Boolean

Returns:

  • (Boolean)


624
625
626
# File 'lib/arachni/uri.rb', line 624

def mailto?
    scheme == 'mailto'
end

#persistent_hashObject



669
670
671
# File 'lib/arachni/uri.rb', line 669

def persistent_hash
    to_s.persistent_hash
end

#query=(q) ⇒ Object



628
629
630
631
632
633
# File 'lib/arachni/uri.rb', line 628

def query=( q )
    q = q.to_s
    q = nil if q.empty?

    @parsed_url.query = q
end

#query_parametersHash

Returns Extracted inputs from a URL query.

Returns:

  • (Hash)

    Extracted inputs from a URL query.



637
638
639
640
641
642
643
644
645
646
# File 'lib/arachni/uri.rb', line 637

def query_parameters
    q = self.query
    return {} if q.to_s.empty?

    q.split( '&' ).inject( {} ) do |h, pair|
        name, value = pair.split( '=', 2 )
        h[::URI.decode( name.to_s )] = ::URI.decode( value.to_s )
        h
    end
end

#resource_extensionString?

Returns The extension of the URI #file_name, ‘nil` if there is none.

Returns:

  • (String, nil)

    The extension of the URI #file_name, ‘nil` if there is none.



566
567
568
569
570
571
# File 'lib/arachni/uri.rb', line 566

def resource_extension
    name = resource_name.to_s
    return if !name.include?( '.' )

    name.split( '.' ).last
end

#resource_nameString

Returns Name of the resource.

Returns:

  • (String)

    Name of the resource.



560
561
562
# File 'lib/arachni/uri.rb', line 560

def resource_name
    path.split( '/' ).last
end

#respond_to?(*args) ⇒ Boolean

Returns:

  • (Boolean)


683
684
685
# File 'lib/arachni/uri.rb', line 683

def respond_to?( *args )
    super || @parsed_url.respond_to?( *args )
end

#rewrite(rules = Arachni::Options.scope.url_rewrites) ⇒ URI

Returns Rewritten URL.

Parameters:

  • rules (Hash<Regexp => String>) (defaults to: Arachni::Options.scope.url_rewrites)

    Regular expression and substitution pairs.

Returns:

  • (URI)

    Rewritten URL.



606
607
608
609
610
611
612
613
614
615
616
# File 'lib/arachni/uri.rb', line 606

def rewrite( rules = Arachni::Options.scope.url_rewrites )
    as_string = self.to_s

    rules.each do |args|
        if (rewritten = as_string.gsub( *args )) != as_string
            return Arachni::URI( rewritten )
        end
    end

    self.dup
end

#scopeScope

Returns:



523
524
525
# File 'lib/arachni/uri.rb', line 523

def scope
    @scope ||= Scope.new( self )
end

#to_absolute(reference) ⇒ Arachni::URI

Converts self into an absolute URL using ‘reference` to fill in the missing data.

Parameters:

Returns:



539
540
541
542
543
544
545
546
547
548
549
550
# File 'lib/arachni/uri.rb', line 539

def to_absolute( reference )
    absolute = case reference
                   when Arachni::URI
                       reference.parsed_url
                   when ::URI
                       reference
                   else
                       self.class.new( reference.to_s ).parsed_url
               end.merge( @parsed_url )

    self.class.new( absolute )
end

#to_sString

Returns:



649
650
651
# File 'lib/arachni/uri.rb', line 649

def to_s
    @parsed_url.to_s
end

#up_to_pathString

Returns The URL up to its path component (no resource name, query, fragment, etc).

Returns:

  • (String)

    The URL up to its path component (no resource name, query, fragment, etc).



575
576
577
578
579
580
581
582
583
584
585
586
# File 'lib/arachni/uri.rb', line 575

def up_to_path
    return if !path
    uri_path = path.dup

    uri_path = File.dirname( uri_path ) if !File.extname( path ).empty?

    uri_path << '/' if uri_path[-1] != '/'

    uri_str = "#{scheme}://#{host}"
    uri_str << ':' + port.to_s if port && port != 80
    uri_str << uri_path
end

#without_queryString

Returns The URL up to its resource component (query, fragment, etc).

Returns:

  • (String)

    The URL up to its resource component (query, fragment, etc).



554
555
556
# File 'lib/arachni/uri.rb', line 554

def without_query
    to_s.split( '?', 2 ).first.to_s
end