Class: Arachni::URI

Inherits:

Object

Object
Arachni::URI

show all

Extended by:: Arachni::UI::Output, Utilities

Includes:: Arachni::UI::Output, Utilities

Defined in:: lib/arachni/uri.rb

Overview

The URI class automatically normalizes the URLs it is passed to parse while maintaining compatibility with Ruby’s URI core classes by delegating missing method to it – thus, you can treat it like a Ruby URI and enjoy some extra perks along the line.

It also provides cached (to maintain low-latency) helper class methods to ease common operations such as:

normalization
parsing to Arachni::URI (see also URI), ::URI or Hash objects.
conversion to absolute URLs

Author:

Tasos “Zapotek” Laskos <[email protected]>

Constant Summary collapse

CACHE_SIZES =

{
    parse:       600,
    ruby_parse:  600,
    cheap_parse: 600,
    normalize:   1000,
    to_absolute: 1000
}

CACHE =

{
    parser:      ::URI::Parser.new,
    ruby_parse:  Cache::RandomReplacement.new( CACHE_SIZES[:ruby_parse] ),
    parse:       Cache::RandomReplacement.new( CACHE_SIZES[:parse] ),
    cheap_parse: Cache::RandomReplacement.new( CACHE_SIZES[:cheap_parse] ),
    normalize:   Cache::RandomReplacement.new( CACHE_SIZES[:normalize] ),
    to_absolute: Cache::RandomReplacement.new( CACHE_SIZES[:to_absolute] )
}

Class Method Summary collapse

.addressable_parse(url) ⇒ Hash

Performs a parse using the URI::Addressable lib while normalizing the URL (will also discard the fragment).
.cheap_parse(url) ⇒ Hash

Performs a parse that is less resource intensive than Ruby’s URI lib’s method while normalizing the URL (will also discard the fragment).
.decode(string) ⇒ String

URL decodes a string.
.deep_decode(string) ⇒ String

Iteratively URL decodes a String until there are no more characters to be unescaped.
.encode(string, bad_characters = nil) ⇒ String

URL encodes a string.
.normalize(url) ⇒ String

Uses URI.cheap_parse to parse and normalize the URL and then converts it to a common String format.
.parse(url) ⇒ Object

Cached version of #initialize, if there’s a chance that the same URL will be needed to be parsed multiple times you should use this method.
.parser ⇒ URI::Parser

Cached URI parser.
.ruby_parse(url) ⇒ URI

Normalizes url and uses Ruby’s core URI lib to parse it.
.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ String

Normalizes and converts a relative URL to an absolute one by merging in with a reference URL.

Instance Method Summary collapse

#==(other) ⇒ Object
#domain ⇒ String

Domain_name.tld.
#exclude?(patterns) ⇒ Bool

Checks if self should be excluded based on the provided patterns.
#in_domain?(include_subdomain, other) ⇒ Bool

true if self is in the same domain as the other URL, false otherwise.
#include?(patterns) ⇒ Bool

Checks if self should be included based on the provided patterns.
#initialize(url) ⇒ URI constructor

Normalizes and parses the provided URL.
#to_absolute(reference) ⇒ Arachni::URI

Converts self into an absolute URL using reference to fill in the missing data.
#to_s ⇒ String

URL.
#too_deep?(depth) ⇒ Bool

Checks if self exceeds a given directory depth.
#up_to_path ⇒ String

The URL up to its path component (no resource name, query, fragment, etc).

Methods included from Arachni::UI::Output

debug?, debug_off, debug_on, disable_only_positives, flush_buffer, mute, muted?, old_reset_output_options, only_positives, only_positives?, print_bad, print_debug, print_debug_backtrace, print_debug_pp, print_error, print_error_backtrace, print_info, print_line, print_ok, print_status, print_verbose, reroute_to_file, reroute_to_file?, reset_output_options, set_buffer_cap, uncap_buffer, unmute, verbose, verbose?

Methods included from Utilities

cookie_encode, cookies_from_document, cookies_from_file, cookies_from_response, exception_jail, exclude_path?, extract_domain, form_decode, form_encode, form_parse_request_body, forms_from_document, forms_from_response, get_path, hash_keys_to_str, html_decode, html_encode, include_path?, links_from_document, links_from_response, normalize_url, page_from_response, page_from_url, parse_query, parse_set_cookie, parse_url_vars, path_in_domain?, path_too_deep?, remove_constants, seed, skip_path?, uri_decode, uri_encode, uri_parse, uri_parser, url_sanitize

Constructor Details

#initialize(url) ⇒ `URI`

Normalizes and parses the provided URL.

Will discard the fragment component, if there is one.

Parameters:

url (Arachni::URI, String, URI, Hash) —

String URL to parse, URI to convert, or a Hash holding URL components (for ::URI::Generic.build). Also accepts Arachni::URI for convenience.

# File 'lib/arachni/uri.rb', line 447

def initialize( url )
    @arachni_opts = Options.instance

    @parsed_url = case url
                      when String
                          self.class.ruby_parse( url )

                      when ::URI
                          url.dup

                      when Hash
                          ::URI::Generic.build( url )

                      when Arachni::URI
                          self.parsed_url = url.parsed_url.dup

                      else
                          to_string = url.to_s rescue ''
                          msg = "Argument must either be String, URI or Hash"
                          msg << " -- #{url.class.name} '#{to_string}' passed."
                          fail TypeError.new( msg )
                  end
    fail 'Failed to parse URL.' if !@parsed_url
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ `Object` (private)

Delegates unimplemented methods to Ruby’s URI::Generic class for compatibility.

# File 'lib/arachni/uri.rb', line 606

def method_missing( sym, *args, &block )
    if @parsed_url.respond_to?( sym )
        @parsed_url.send( sym, *args, &block )
    else
        super
    end
end

Class Method Details

.addressable_parse(url) ⇒ `Hash`

Performs a parse using the URI::Addressable lib while normalizing the URL (will also discard the fragment).

This method is not cached and solely exists as a fallback used by cheap_parse.

Parameters:

url (String)

Returns:

(Hash) —
URL components: Hash fields:
- scheme – HTTP or HTTPS
- userinfo – username:password
- host
- port
- path
- query
The Hash is suitable for passing to ::URI::Generic.build – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

# File 'lib/arachni/uri.rb', line 326

def self.addressable_parse( url )
    u = Addressable::URI.parse( html_decode( url.to_s ) ).normalize
    u.fragment = nil
    h = u.to_hash

    h[:path].gsub!( /\/+/, '/' ) if h[:path]
    if h[:user]
        h[:userinfo] = h.delete( :user )
        h[:userinfo] << ":#{h.delete( :password )}" if h[:password]
    end
    h
end

.cheap_parse(url) ⇒ `Hash`

Performs a parse that is less resource intensive than Ruby’s URI lib’s method while normalizing the URL (will also discard the fragment).

ATTENTION: This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Parameters:

url (String)

Returns:

(Hash) —
URL components (frozen): Hash fields:
- scheme – HTTP or HTTPS
- userinfo – username:password
- host
- port
- path
- query
The Hash is suitable for passing to ::URI::Generic.build – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

# File 'lib/arachni/uri.rb', line 183

def self.cheap_parse( url )
    return if !url || url.empty?

    cache = CACHE[__method__]

    url   = url.to_s.dup
    c_url = url.to_s.dup

    components = {
        scheme:   nil,
        userinfo: nil,
        host:     nil,
        port:     nil,
        path:     nil,
        query:    nil
    }

    valid_schemes = %w(http https)

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        # we're not smart enough for scheme-less URLs and if we're to go
        # into heuristics then there's no reason to not just use Addressable's parser
        if url.start_with?( '//' )
            return cache[c_url] = addressable_parse( c_url ).freeze
        end

        url = url.encode( 'UTF-8', undef: :replace, invalid: :replace )

        # remove the fragment if there is one
        url = url.split( '#', 2 )[0...-1].join if url.include?( '#' )

        url = html_decode( url )

        dupped_url = url.dup
        has_path = true

        splits = url.split( ':' )
        if !splits.empty? && valid_schemes.include?( splits.first.downcase )
            splits = url.split( '://', 2 )
            components[:scheme] = splits.shift
            components[:scheme].downcase! if components[:scheme]

            if url = splits.shift
                splits = url.split( '?' ).first.split( '@', 2 )

                if splits.size > 1
                    components[:userinfo] = splits.first
                    url = splits.shift
                end

                if !splits.empty?
                    splits = splits.last.split( '/', 2 )
                    url = splits.last

                    splits = splits.first.split( ':', 2 )
                    if splits.size == 2
                        host = splits.first
                        components[:port] = Integer( splits.last ) if splits.last && !splits.last.empty?
                        components[:port] = nil if components[:port] == 80
                        url.gsub!( ':' + components[:port].to_s, '' )
                    else
                        host = splits.last
                    end

                    if components[:host] = host
                        url.gsub!( host, '' )
                        components[:host].downcase!
                    end
                else
                    has_path = false
                end
            else
                has_path = false
            end
        end

        if has_path
            splits = url.split( '?', 2 )
            if components[:path] = splits.shift
                components[:path] = '/' + components[:path] if components[:scheme]
                components[:path].gsub!( /\/+/, '/' )
                components[:path] =
                    encode( decode( components[:path] ),
                            Addressable::URI::CharacterClasses::PATH )
            end

            if c_url.include?( '?' ) && !(query = dupped_url.split( '?', 2 ).last).empty?
                components[:query] = (query.split( '&', -1 ).map do |pair|
                    encode( decode( pair ), Addressable::URI::CharacterClasses::QUERY.sub( '\\&', '' ) )
                end).join( '&' )
            end
        end

        components[:path] ||= components[:scheme] ? '/' : nil

        cache[c_url] = components.inject({}) do |h, (k, val)|
            h.merge!( Hash[{ k => val.freeze }] )
        end.freeze
    rescue => e
        begin
            print_error "Failed to fast-parse '#{c_url}', falling back to slow-parse."
            #print_error "Error: #{e}"
            #print_error_backtrace( e )

            cache[c_url] = addressable_parse( c_url ).freeze
        rescue => ex
            print_error "Failed to parse '#{c_url}'."
            #print_error "Error: #{ex}"
            #print_error_backtrace( ex )

            cache[c_url] = :err
            nil
        end
    end
end

.decode(string) ⇒ `String`

URL decodes a string.

Parameters:

string (String)

Returns:

(String)



95
96
97

# File 'lib/arachni/uri.rb', line 95

def self.decode( string )
    Addressable::URI.unencode( string )
end

.deep_decode(string) ⇒ `String`

Iteratively URL decodes a String until there are no more characters to be unescaped.

Parameters:

string (String)

Returns:

(String)



107
108
109

# File 'lib/arachni/uri.rb', line 107

def self.deep_decode( string )
    string = decode( string ) while string =~ /%[a-fA-F0-9]{2}/
end

.encode(string, bad_characters = nil) ⇒ `String`

URL encodes a string.

Parameters:

string (String)
bad_characters (String, Regexp) (defaults to: nil) —

Class of characters to encode – if String is passed, it should formatted as a regexp (for Regexp.new).

Returns:

(String) —

encoded string



84
85
86

# File 'lib/arachni/uri.rb', line 84

def self.encode( string, bad_characters = nil )
    Addressable::URI.encode_component( *[string, bad_characters].compact )
end

.normalize(url) ⇒ `String`

Uses cheap_parse to parse and normalize the URL and then converts it to a common String format.

Parameters:

url (String)

Returns:

(String) —

normalized URL (frozen)

# File 'lib/arachni/uri.rb', line 393

def self.normalize( url )
    return if !url || url.empty?

    cache = CACHE[__method__]

    url   = url.to_s.strip.dup
    c_url = url.to_s.strip.dup

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        components = cheap_parse( url )

        #ap components
        normalized = ''
        normalized << components[:scheme] + '://' if components[:scheme]

        if components[:userinfo]
            normalized << components[:userinfo]
            normalized << '@'
        end

        if components[:host]
            normalized << components[:host]
            normalized << ':' + components[:port].to_s if components[:port]
        end

        normalized << components[:path] if components[:path]
        normalized << '?' + components[:query] if components[:query]

        cache[c_url] = normalized.freeze
    rescue => e
        print_error "Failed to normalize '#{c_url}'."
        #print_error "Error: #{e}"
        #print_error_backtrace( e )

        cache[c_url] = :err
        nil
    end
end

.parse(url) ⇒ `Object`

Cached version of #initialize, if there’s a chance that the same URL will be needed to be parsed multiple times you should use this method.

.parser ⇒ `URI::Parser`

Returns cached URI parser.

Returns:

(URI::Parser) —

cached URI parser



70
71
72

# File 'lib/arachni/uri.rb', line 70

def self.parser
    CACHE[__method__]
end

.ruby_parse(url) ⇒ `URI`

Normalizes url and uses Ruby’s core URI lib to parse it.

Parameters:

url (String) —

URL to parse

Returns:

(URI)

# File 'lib/arachni/uri.rb', line 144

def self.ruby_parse( url )
    return url if url.to_s.empty? || url.is_a?( ::URI )
    CACHE[__method__][url] ||= begin
        ::URI::Generic.build( cheap_parse( url ) )
    rescue
        begin
            parser.parse( normalize( url ).dup )
        rescue => e
            print_error "Failed to parse '#{url}'."
            #print_error "Error: #{e}"
            #print_error_backtrace( e )
            nil
        end
    end
end

.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ `String`

Normalizes and converts a relative URL to an absolute one by merging in with a reference URL.

Pretty much a cached version of #to_absolute.

Parameters:

relative (String)
reference (String) (defaults to: Options.instance.url.to_s) —

absolute url to use as a reference

Returns:

(String) —

absolute URL (frozen)

# File 'lib/arachni/uri.rb', line 354

def self.to_absolute( relative, reference = Options.instance.url.to_s )
    return reference if !relative || relative.empty?
    key = relative + ' :: ' + reference

    cache = CACHE[__method__]
    begin
        if (v = cache[key]) && v == :err
            return
        elsif v
            return v
        end

        parsed_ref = parse( reference )

        # scheme-less URLs are expensive to parse so let's resolve the issue here
        relative = "#{parsed_ref.scheme}:#{relative}" if relative.start_with?( '//' )

        cache[key] = parse( relative ).to_absolute( parsed_ref ).to_s.freeze
    rescue# => e
          #ap relative
          #ap e
          #ap e.backtrace
        cache[key] = :err
        nil
    end
end

Instance Method Details

#==(other) ⇒ `Object`



472
473
474

# File 'lib/arachni/uri.rb', line 472

def ==( other )
    to_s == other.to_s
end

#domain ⇒ `String`

Returns domain_name.tld.

Returns:

(String) —

domain_name.tld

# File 'lib/arachni/uri.rb', line 511

def domain
    s = host.split( '.' )
    return s.first if s.size == 1
    return host if s.size == 2

    s[1..-1].join( '.' )
end

#exclude?(patterns) ⇒ `Bool`

Checks if self should be excluded based on the provided patterns.

Parameters:

patterns (Array<Regexp,String>)

Returns:

(Bool) —

true if self matches a pattern, false otherwise

# File 'lib/arachni/uri.rb', line 537

def exclude?( patterns )
    fail TypeError.new( 'Array<Regexp,String> expected, got nil instead' ) if patterns.nil?
    ensure_patterns( patterns ).each { |pattern| return true if to_s =~ pattern }
    false
end

#in_domain?(include_subdomain, other) ⇒ `Bool`

Returns true if self is in the same domain as the other URL, false otherwise.

Parameters:

include_subdomain (Bool) —

Match subdomains too? If true will compare full hostnames, otherwise will discard subdomains.
other (Arachni::URI, URI, Hash, String) —

URL to compare it to

Returns:

(Bool) —

true if self is in the same domain as the other URL, false otherwise

# File 'lib/arachni/uri.rb', line 570

def in_domain?( include_subdomain, other )
    return true if !other

    other = self.class.new( other ) if !other.is_a?( Arachni::URI )
    include_subdomain ? other.host == host : other.domain == domain
end

#include?(patterns) ⇒ `Bool`

Checks if self should be included based on the provided patterns.

Parameters:

patterns (Array<Regexp,String>)

Returns:

(Bool) —

true if self matches a pattern (or patterns are nil or empty), false otherwise

# File 'lib/arachni/uri.rb', line 551

def include?( patterns )
    fail TypeError.new( 'Array<Regexp,String> expected, got nil instead' ) if patterns.nil?

    rules = ensure_patterns( patterns )
    return true if !rules || rules.empty?

    rules.each { |pattern| return true if to_s =~ pattern }
    false
end

#to_absolute(reference) ⇒ `Arachni::URI`

Converts self into an absolute URL using reference to fill in the missing data.

Parameters:

reference (Arachni::URI, URI, String) —

absolute URL

Returns:

(Arachni::URI) —

self as an absolute URL

# File 'lib/arachni/uri.rb', line 483

def to_absolute( reference )
    absolute = case reference
                   when Arachni::URI
                       reference.parsed_url
                   when ::URI
                       reference
                   else
                       self.class.new( reference.to_s ).parsed_url
               end.merge( @parsed_url )

    self.class.new( absolute )
end

#to_s ⇒ `String`

Returns URL.

Returns:

(String) —

URL



578
579
580

# File 'lib/arachni/uri.rb', line 578

def to_s
    @parsed_url.to_s
end

#too_deep?(depth) ⇒ `Bool`

Checks if self exceeds a given directory depth.

Parameters:

depth (Integer) —

depth to check for

Returns:

(Bool) —

true if self is deeper than depth, false otherwise



526
527
528

# File 'lib/arachni/uri.rb', line 526

def too_deep?( depth )
    depth > 0 && (depth + 1) <= path.count( '/' )
end

#up_to_path ⇒ `String`

Returns the URL up to its path component (no resource name, query, fragment, etc).

Returns:

(String) —

the URL up to its path component (no resource name, query, fragment, etc)

# File 'lib/arachni/uri.rb', line 498

def up_to_path
    uri_path = path.dup

    uri_path = File.dirname( uri_path ) if !File.extname( path ).empty?

    uri_path << '/' if uri_path[-1] != '/'

    uri_str = scheme + "://" + host
    uri_str << ':' + port.to_s if port && port != 80
    uri_str << uri_path
end

Class: Arachni::URI

Overview

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Arachni::UI::Output

Methods included from Utilities

Constructor Details

#initialize(url) ⇒ URI

Dynamic Method Handling

#method_missing(sym, *args, &block) ⇒ Object (private)

Class Method Details

.addressable_parse(url) ⇒ Hash

.cheap_parse(url) ⇒ Hash

.decode(string) ⇒ String

.deep_decode(string) ⇒ String

.encode(string, bad_characters = nil) ⇒ String

.normalize(url) ⇒ String

.parse(url) ⇒ Object

.parser ⇒ URI::Parser

.ruby_parse(url) ⇒ URI

.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ String

Instance Method Details

#==(other) ⇒ Object

#domain ⇒ String

#exclude?(patterns) ⇒ Bool

#in_domain?(include_subdomain, other) ⇒ Bool

#include?(patterns) ⇒ Bool

#to_absolute(reference) ⇒ Arachni::URI

#to_s ⇒ String

#too_deep?(depth) ⇒ Bool

#up_to_path ⇒ String

#initialize(url) ⇒ `URI`

#method_missing(sym, *args, &block) ⇒ `Object` (private)

.addressable_parse(url) ⇒ `Hash`

.cheap_parse(url) ⇒ `Hash`

.decode(string) ⇒ `String`

.deep_decode(string) ⇒ `String`

.encode(string, bad_characters = nil) ⇒ `String`

.normalize(url) ⇒ `String`

.parse(url) ⇒ `Object`

.parser ⇒ `URI::Parser`

.ruby_parse(url) ⇒ `URI`

.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ `String`

#==(other) ⇒ `Object`

#domain ⇒ `String`

#exclude?(patterns) ⇒ `Bool`

#in_domain?(include_subdomain, other) ⇒ `Bool`

#include?(patterns) ⇒ `Bool`

#to_absolute(reference) ⇒ `Arachni::URI`

#to_s ⇒ `String`

#too_deep?(depth) ⇒ `Bool`

#up_to_path ⇒ `String`