Class: Arachni::URI

Inherits:
Object show all
Extended by:
Arachni::UI::Output, Utilities
Includes:
Arachni::UI::Output, Utilities
Defined in:
lib/arachni/uri.rb

Overview

The URI class automatically normalizes the URLs it is passed to parse while maintaining compatibility with Ruby’s URI core classes by delegating missing method to it – thus, you can treat it like a Ruby URI and enjoy some extra perks along the line.

It also provides cached (to maintain low-latency) helper class methods to ease common operations such as:

Author:

Constant Summary collapse

CACHE_SIZES =
{
    parse:       600,
    ruby_parse:  600,
    cheap_parse: 600,
    normalize:   1000,
    to_absolute: 1000
}
CACHE =
{
    parser:      ::URI::Parser.new,
    ruby_parse:  Cache::RandomReplacement.new( CACHE_SIZES[:ruby_parse] ),
    parse:       Cache::RandomReplacement.new( CACHE_SIZES[:parse] ),
    cheap_parse: Cache::RandomReplacement.new( CACHE_SIZES[:cheap_parse] ),
    normalize:   Cache::RandomReplacement.new( CACHE_SIZES[:normalize] ),
    to_absolute: Cache::RandomReplacement.new( CACHE_SIZES[:to_absolute] )
}

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Arachni::UI::Output

debug?, debug_off, debug_on, disable_only_positives, flush_buffer, mute, muted?, old_reset_output_options, only_positives, only_positives?, print_bad, print_debug, print_debug_backtrace, print_debug_pp, print_error, print_error_backtrace, print_info, print_line, print_ok, print_status, print_verbose, reroute_to_file, reroute_to_file?, reset_output_options, set_buffer_cap, uncap_buffer, unmute, verbose, verbose?

Methods included from Utilities

cookie_encode, cookies_from_document, cookies_from_file, cookies_from_response, exception_jail, exclude_path?, extract_domain, form_decode, form_encode, form_parse_request_body, forms_from_document, forms_from_response, get_path, hash_keys_to_str, html_decode, html_encode, include_path?, links_from_document, links_from_response, normalize_url, page_from_response, page_from_url, parse_query, parse_set_cookie, parse_url_vars, path_in_domain?, path_too_deep?, remove_constants, seed, skip_path?, uri_decode, uri_encode, uri_parse, uri_parser, url_sanitize

Constructor Details

#initialize(url) ⇒ URI

Normalizes and parses the provided URL.

Will discard the fragment component, if there is one.

Parameters:

  • url (Arachni::URI, String, URI, Hash)

    String URL to parse, URI to convert, or a Hash holding URL components (for ::URI::Generic.build). Also accepts Arachni::URI for convenience.



447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
# File 'lib/arachni/uri.rb', line 447

def initialize( url )
    @arachni_opts = Options.instance

    @parsed_url = case url
                      when String
                          self.class.ruby_parse( url )

                      when ::URI
                          url.dup

                      when Hash
                          ::URI::Generic.build( url )

                      when Arachni::URI
                          self.parsed_url = url.parsed_url.dup

                      else
                          to_string = url.to_s rescue ''
                          msg = "Argument must either be String, URI or Hash"
                          msg << " -- #{url.class.name} '#{to_string}' passed."
                          fail TypeError.new( msg )
                  end
    fail 'Failed to parse URL.' if !@parsed_url
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ Object (private)

Delegates unimplemented methods to Ruby’s URI::Generic class for compatibility.



606
607
608
609
610
611
612
# File 'lib/arachni/uri.rb', line 606

def method_missing( sym, *args, &block )
    if @parsed_url.respond_to?( sym )
        @parsed_url.send( sym, *args, &block )
    else
        super
    end
end

Class Method Details

.addressable_parse(url) ⇒ Hash

Performs a parse using the URI::Addressable lib while normalizing the URL (will also discard the fragment).

This method is not cached and solely exists as a fallback used by cheap_parse.

Parameters:

Returns:

  • (Hash)

    URL components: Hash fields:

    • scheme – HTTP or HTTPS

    • userinfo – username:password

    • host

    • port

    • path

    • query

    The Hash is suitable for passing to ::URI::Generic.build – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.



326
327
328
329
330
331
332
333
334
335
336
337
# File 'lib/arachni/uri.rb', line 326

def self.addressable_parse( url )
    u = Addressable::URI.parse( html_decode( url.to_s ) ).normalize
    u.fragment = nil
    h = u.to_hash

    h[:path].gsub!( /\/+/, '/' ) if h[:path]
    if h[:user]
        h[:userinfo] = h.delete( :user )
        h[:userinfo] << ":#{h.delete( :password )}" if h[:password]
    end
    h
end

.cheap_parse(url) ⇒ Hash

Performs a parse that is less resource intensive than Ruby’s URI lib’s method while normalizing the URL (will also discard the fragment).

ATTENTION: This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Parameters:

Returns:

  • (Hash)

    URL components (frozen): Hash fields:

    • scheme – HTTP or HTTPS

    • userinfo – username:password

    • host

    • port

    • path

    • query

    The Hash is suitable for passing to ::URI::Generic.build – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.



183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
# File 'lib/arachni/uri.rb', line 183

def self.cheap_parse( url )
    return if !url || url.empty?

    cache = CACHE[__method__]

    url   = url.to_s.dup
    c_url = url.to_s.dup

    components = {
        scheme:   nil,
        userinfo: nil,
        host:     nil,
        port:     nil,
        path:     nil,
        query:    nil
    }

    valid_schemes = %w(http https)

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        # we're not smart enough for scheme-less URLs and if we're to go
        # into heuristics then there's no reason to not just use Addressable's parser
        if url.start_with?( '//' )
            return cache[c_url] = addressable_parse( c_url ).freeze
        end

        url = url.encode( 'UTF-8', undef: :replace, invalid: :replace )

        # remove the fragment if there is one
        url = url.split( '#', 2 )[0...-1].join if url.include?( '#' )

        url = html_decode( url )

        dupped_url = url.dup
        has_path = true

        splits = url.split( ':' )
        if !splits.empty? && valid_schemes.include?( splits.first.downcase )
            splits = url.split( '://', 2 )
            components[:scheme] = splits.shift
            components[:scheme].downcase! if components[:scheme]

            if url = splits.shift
                splits = url.split( '?' ).first.split( '@', 2 )

                if splits.size > 1
                    components[:userinfo] = splits.first
                    url = splits.shift
                end

                if !splits.empty?
                    splits = splits.last.split( '/', 2 )
                    url = splits.last

                    splits = splits.first.split( ':', 2 )
                    if splits.size == 2
                        host = splits.first
                        components[:port] = Integer( splits.last ) if splits.last && !splits.last.empty?
                        components[:port] = nil if components[:port] == 80
                        url.gsub!( ':' + components[:port].to_s, '' )
                    else
                        host = splits.last
                    end

                    if components[:host] = host
                        url.gsub!( host, '' )
                        components[:host].downcase!
                    end
                else
                    has_path = false
                end
            else
                has_path = false
            end
        end

        if has_path
            splits = url.split( '?', 2 )
            if components[:path] = splits.shift
                components[:path] = '/' + components[:path] if components[:scheme]
                components[:path].gsub!( /\/+/, '/' )
                components[:path] =
                    encode( decode( components[:path] ),
                            Addressable::URI::CharacterClasses::PATH )
            end

            if c_url.include?( '?' ) && !(query = dupped_url.split( '?', 2 ).last).empty?
                components[:query] =
                    encode( decode( query ),
                            Addressable::URI::CharacterClasses::QUERY )
            end
        end

        components[:path] ||= components[:scheme] ? '/' : nil

        cache[c_url] = components.inject({}) do |h, (k, val)|
            h.merge!( Hash[{ k => val.freeze }] )
        end.freeze
    rescue => e
        begin
            print_error "Failed to fast-parse '#{c_url}', falling back to slow-parse."
            #print_error "Error: #{e}"
            #print_error_backtrace( e )

            cache[c_url] = addressable_parse( c_url ).freeze
        rescue => ex
            print_error "Failed to parse '#{c_url}'."
            #print_error "Error: #{ex}"
            #print_error_backtrace( ex )

            cache[c_url] = :err
            nil
        end
    end
end

.decode(string) ⇒ String

URL decodes a string.

Parameters:

Returns:



95
96
97
# File 'lib/arachni/uri.rb', line 95

def self.decode( string )
    Addressable::URI.unencode( string )
end

.deep_decode(string) ⇒ String

Iteratively URL decodes a String until there are no more characters to be unescaped.

Parameters:

Returns:



107
108
109
# File 'lib/arachni/uri.rb', line 107

def self.deep_decode( string )
    string = decode( string ) while string =~ /%[a-fA-F0-9]{2}/
end

.encode(string, bad_characters = nil) ⇒ String

URL encodes a string.

Parameters:

  • string (String)
  • bad_characters (String, Regexp) (defaults to: nil)

    Class of characters to encode – if String is passed, it should formatted as a regexp (for Regexp.new).

Returns:



84
85
86
# File 'lib/arachni/uri.rb', line 84

def self.encode( string, bad_characters = nil )
    Addressable::URI.encode_component( *[string, bad_characters].compact )
end

.normalize(url) ⇒ String

Uses cheap_parse to parse and normalize the URL and then converts it to a common String format.

ATTENTION: This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Parameters:

Returns:

  • (String)

    normalized URL (frozen)



393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
# File 'lib/arachni/uri.rb', line 393

def self.normalize( url )
    return if !url || url.empty?

    cache = CACHE[__method__]

    url   = url.to_s.strip.dup
    c_url = url.to_s.strip.dup

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        components = cheap_parse( url )

        #ap components
        normalized = ''
        normalized << components[:scheme] + '://' if components[:scheme]

        if components[:userinfo]
            normalized << components[:userinfo]
            normalized << '@'
        end

        if components[:host]
            normalized << components[:host]
            normalized << ':' + components[:port].to_s if components[:port]
        end

        normalized << components[:path] if components[:path]
        normalized << '?' + components[:query] if components[:query]

        cache[c_url] = normalized.freeze
    rescue => e
        print_error "Failed to normalize '#{c_url}'."
        #print_error "Error: #{e}"
        #print_error_backtrace( e )

        cache[c_url] = :err
        nil
    end
end

.parse(url) ⇒ Object

Cached version of #initialize, if there’s a chance that the same URL will be needed to be parsed multiple times you should use this method.

ATTENTION: This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

See Also:



121
122
123
124
125
126
127
128
129
130
131
# File 'lib/arachni/uri.rb', line 121

def self.parse( url )
    return url if !url || url.is_a?( Arachni::URI )
    CACHE[__method__][url] ||= begin
        new( url )
    rescue => e
        print_error "Failed to parse '#{url}'."
        #print_error "Error: #{e}"
        #print_error_backtrace( e )
        nil
    end
end

.parserURI::Parser

Returns cached URI parser.

Returns:

  • (URI::Parser)

    cached URI parser



70
71
72
# File 'lib/arachni/uri.rb', line 70

def self.parser
    CACHE[__method__]
end

.ruby_parse(url) ⇒ URI

Normalizes url and uses Ruby’s core URI lib to parse it.

ATTENTION: This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Parameters:

  • url (String)

    URL to parse

Returns:



144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/arachni/uri.rb', line 144

def self.ruby_parse( url )
    return url if url.to_s.empty? || url.is_a?( ::URI )
    CACHE[__method__][url] ||= begin
        ::URI::Generic.build( cheap_parse( url ) )
    rescue
        begin
            parser.parse( normalize( url ).dup )
        rescue => e
            print_error "Failed to parse '#{url}'."
            #print_error "Error: #{e}"
            #print_error_backtrace( e )
            nil
        end
    end
end

.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ String

Normalizes and converts a relative URL to an absolute one by merging in with a reference URL.

Pretty much a cached version of #to_absolute.

ATTENTION: This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Parameters:

  • relative (String)
  • reference (String) (defaults to: Options.instance.url.to_s)

    absolute url to use as a reference

Returns:

  • (String)

    absolute URL (frozen)



354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
# File 'lib/arachni/uri.rb', line 354

def self.to_absolute( relative, reference = Options.instance.url.to_s )
    return reference if !relative || relative.empty?
    key = relative + ' :: ' + reference

    cache = CACHE[__method__]
    begin
        if (v = cache[key]) && v == :err
            return
        elsif v
            return v
        end

        parsed_ref = parse( reference )

        # scheme-less URLs are expensive to parse so let's resolve the issue here
        relative = "#{parsed_ref.scheme}:#{relative}" if relative.start_with?( '//' )

        cache[key] = parse( relative ).to_absolute( parsed_ref ).to_s.freeze
    rescue# => e
          #ap relative
          #ap e
          #ap e.backtrace
        cache[key] = :err
        nil
    end
end

Instance Method Details

#==(other) ⇒ Object



472
473
474
# File 'lib/arachni/uri.rb', line 472

def ==( other )
    to_s == other.to_s
end

#domainString

Returns domain_name.tld.

Returns:

  • (String)

    domain_name.tld



511
512
513
514
515
516
517
# File 'lib/arachni/uri.rb', line 511

def domain
    s = host.split( '.' )
    return s.first if s.size == 1
    return host if s.size == 2

    s[1..-1].join( '.' )
end

#exclude?(patterns) ⇒ Bool

Checks if self should be excluded based on the provided patterns.

Parameters:

Returns:

  • (Bool)

    true if self matches a pattern, false otherwise



537
538
539
540
541
# File 'lib/arachni/uri.rb', line 537

def exclude?( patterns )
    fail TypeError.new( 'Array<Regexp,String> expected, got nil instead' ) if patterns.nil?
    ensure_patterns( patterns ).each { |pattern| return true if to_s =~ pattern }
    false
end

#in_domain?(include_subdomain, other) ⇒ Bool

Returns true if self is in the same domain as the other URL, false otherwise.

Parameters:

  • include_subdomain (Bool)

    Match subdomains too? If true will compare full hostnames, otherwise will discard subdomains.

  • other (Arachni::URI, URI, Hash, String)

    URL to compare it to

Returns:

  • (Bool)

    true if self is in the same domain as the other URL, false otherwise



570
571
572
573
574
575
# File 'lib/arachni/uri.rb', line 570

def in_domain?( include_subdomain, other )
    return true if !other

    other = self.class.new( other ) if !other.is_a?( Arachni::URI )
    include_subdomain ? other.host == host : other.domain == domain
end

#include?(patterns) ⇒ Bool

Checks if self should be included based on the provided patterns.

Parameters:

Returns:

  • (Bool)

    true if self matches a pattern (or patterns are nil or empty), false otherwise



551
552
553
554
555
556
557
558
559
# File 'lib/arachni/uri.rb', line 551

def include?( patterns )
    fail TypeError.new( 'Array<Regexp,String> expected, got nil instead' ) if patterns.nil?

    rules = ensure_patterns( patterns )
    return true if !rules || rules.empty?

    rules.each { |pattern| return true if to_s =~ pattern }
    false
end

#to_absolute(reference) ⇒ Arachni::URI

Converts self into an absolute URL using reference to fill in the missing data.

Parameters:

Returns:



483
484
485
486
487
488
489
490
491
492
493
494
# File 'lib/arachni/uri.rb', line 483

def to_absolute( reference )
    absolute = case reference
                   when Arachni::URI
                       reference.parsed_url
                   when ::URI
                       reference
                   else
                       self.class.new( reference.to_s ).parsed_url
               end.merge( @parsed_url )

    self.class.new( absolute )
end

#to_sString

Returns URL.

Returns:



578
579
580
# File 'lib/arachni/uri.rb', line 578

def to_s
    @parsed_url.to_s
end

#too_deep?(depth) ⇒ Bool

Checks if self exceeds a given directory depth.

Parameters:

  • depth (Integer)

    depth to check for

Returns:

  • (Bool)

    true if self is deeper than depth, false otherwise



526
527
528
# File 'lib/arachni/uri.rb', line 526

def too_deep?( depth )
    depth > 0 && (depth + 1) <= path.count( '/' )
end

#up_to_pathString

Returns the URL up to its path component (no resource name, query, fragment, etc).

Returns:

  • (String)

    the URL up to its path component (no resource name, query, fragment, etc)



498
499
500
501
502
503
504
505
506
507
508
# File 'lib/arachni/uri.rb', line 498

def up_to_path
    uri_path = path.dup

    uri_path = File.dirname( uri_path ) if !File.extname( path ).empty?

    uri_path << '/' if uri_path[-1] != '/'

    uri_str = scheme + "://" + host
    uri_str << ':' + port.to_s if port && port != 80
    uri_str << uri_path
end