Class: Anemone::Page

Inherits:

Object

Object
Anemone::Page

show all

Includes:: Arachni::UI::Output

Defined in:: lib/anemone/page.rb

Instance Attribute Summary collapse

#body ⇒ Object readonly

The raw HTTP response body of the page.
#code ⇒ Object

Integer response code of the page.
#data ⇒ Object

OpenStruct for user-stored data.
#depth ⇒ Object

Depth of this page from the root of the crawl.
#error ⇒ Object readonly

Exception object, if one was raised during HTTP#fetch_page.
#headers ⇒ Object readonly

Headers of the HTTP response.
#redirect_to ⇒ Object readonly

URL of the page this one redirected to, if any.
#referer ⇒ Object

URL of the page that brought us to this page.
#response_time ⇒ Object

Response time of the request for this page in milliseconds.
#url ⇒ Object readonly

The URL of the page.
#visited ⇒ Object

Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!.

Class Method Summary collapse

.from_hash(hash) ⇒ Object

Instance Method Summary collapse

#base ⇒ Object
#content_type ⇒ Object

The content-type returned by the HTTP request for this page.
#cookies ⇒ Object

Array of cookies received with this page as WEBrick::Cookie objects.
#dir(url) ⇒ Object
#discard_doc! ⇒ Object

Delete the Nokogiri document and response body to conserve memory.
#doc ⇒ Object

Nokogiri document for the HTML body.
#extract_domain(url) ⇒ String

Extracts the domain from a URI object.
#fetched? ⇒ Boolean

Was the page successfully fetched? true if the page was fetched with no error, false otherwise.
#html? ⇒ Boolean

Returns true if the page is a HTML document, returns false otherwise.
#in_domain?(uri) ⇒ Boolean

Returns true if uri is in the same domain as the page, returns false otherwise.
#initialize(url, params = {}) ⇒ Page constructor

Create a new page.
#links ⇒ Array<URI>

Array of distinct links to follow.
#marshal_dump ⇒ Object
#marshal_load(ary) ⇒ Object
#not_found? ⇒ Boolean

Returns true if the page was not found (returned 404 code), returns false otherwise.
#redirect? ⇒ Boolean

Returns true if the page is a HTTP redirect, returns false otherwise.
#run_modules ⇒ Array

Runs all Spider (path extraction) modules and returns an array of paths.
#to_absolute(link) ⇒ Object

Converts relative URL link into an absolute URL based on the location of the page.
#to_hash ⇒ Object

Methods included from Arachni::UI::Output

#buffer, #debug!, #debug?, #flush_buffer, #mute!, #muted?, #only_positives!, #only_positives?, #print_debug, #print_debug_backtrace, #print_debug_pp, #print_error, #print_error_backtrace, #print_info, #print_line, #print_ok, #print_status, #print_verbose, #reroute_to_file, #reroute_to_file?, #unmute!, #verbose!, #verbose?

Constructor Details

#initialize(url, params = {}) ⇒ `Page`

Create a new page

# File 'lib/anemone/page.rb', line 88

def initialize(url, params = {})
  @url = url
  @data = OpenStruct.new

  @code = params[:code]
  @headers = params[:headers] || {}
  @headers['content-type'] ||= ['']
  @aliases = Array(params[:aka]).compact
  @referer = params[:referer]
  @depth = params[:depth] || 0
  @redirect_to = to_absolute(params[:redirect_to])
  @response_time = params[:response_time]
  @body = params[:body]
  @error = params[:error]

  @fetched = !params[:code].nil?
end

Instance Attribute Details

#body ⇒ `Object` (readonly)

The raw HTTP response body of the page



63
64
65

# File 'lib/anemone/page.rb', line 63

def body
  @body
end

#code ⇒ `Object`

Integer response code of the page



74
75
76

# File 'lib/anemone/page.rb', line 74

def code
  @code
end

#data ⇒ `Object`

OpenStruct for user-stored data



72
73
74

# File 'lib/anemone/page.rb', line 72

def data
  @data
end

#depth ⇒ `Object`

Depth of this page from the root of the crawl. This is not necessarily the shortest path; use PageStore#shortest_paths! to find that value.



79
80
81

# File 'lib/anemone/page.rb', line 79

def depth
  @depth
end

#error ⇒ `Object` (readonly)

Exception object, if one was raised during HTTP#fetch_page



69
70
71

# File 'lib/anemone/page.rb', line 69

def error
  @error
end

#headers ⇒ `Object` (readonly)

Headers of the HTTP response



65
66
67

# File 'lib/anemone/page.rb', line 65

def headers
  @headers
end

#redirect_to ⇒ `Object` (readonly)

URL of the page this one redirected to, if any



67
68
69

# File 'lib/anemone/page.rb', line 67

def redirect_to
  @redirect_to
end

#referer ⇒ `Object`

URL of the page that brought us to this page



81
82
83

# File 'lib/anemone/page.rb', line 81

def referer
  @referer
end

#response_time ⇒ `Object`

Response time of the request for this page in milliseconds



83
84
85

# File 'lib/anemone/page.rb', line 83

def response_time
  @response_time
end

#url ⇒ `Object` (readonly)

The URL of the page



61
62
63

# File 'lib/anemone/page.rb', line 61

def url
  @url
end

#visited ⇒ `Object`

Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!



76
77
78

# File 'lib/anemone/page.rb', line 76

def visited
  @visited
end

Class Method Details

.from_hash(hash) ⇒ `Object`

# File 'lib/anemone/page.rb', line 317

def self.from_hash(hash)
  page = self.new(URI(hash['url']))
  {'@headers' => Marshal.load(hash['headers']),
   '@data' => Marshal.load(hash['data']),
   '@body' => hash['body'],
   '@links' => hash['links'].map { |link| URI(link) },
   '@code' => hash['code'].to_i,
   '@visited' => hash['visited'],
   '@depth' => hash['depth'].to_i,
   '@referer' => hash['referer'],
   '@redirect_to' => URI(hash['redirect_to']),
   '@response_time' => hash['response_time'].to_i,
   '@fetched' => hash['fetched']
  }.each do |var, value|
    page.instance_variable_set(var, value)
  end
  page
end

Instance Method Details

#base ⇒ `Object`

# File 'lib/anemone/page.rb', line 252

def base
  begin
    tmp = doc.search( '//base[@href]' )
    return tmp[0]['href'].dup
  rescue
    return
  end
end

#content_type ⇒ `Object`

The content-type returned by the HTTP request for this page



200
201
202

# File 'lib/anemone/page.rb', line 200

def content_type
  headers['content-type'].first
end

#cookies ⇒ `Object`

Array of cookies received with this page as WEBrick::Cookie objects.



193
194
195

# File 'lib/anemone/page.rb', line 193

def cookies
  WEBrick::Cookie.parse_set_cookies(@headers['Set-Cookie']) rescue []
end

#dir(url) ⇒ `Object`



132
133
134

# File 'lib/anemone/page.rb', line 132

def dir( url )
    URI( File.dirname( URI( url.to_s ).path ) + '/' )
end

#discard_doc! ⇒ `Object`

Delete the Nokogiri document and response body to conserve memory

# File 'lib/anemone/page.rb', line 177

def discard_doc!
  links # force parsing of page links before we trash the document
  @doc = @body = nil
end

#doc ⇒ `Object`

Nokogiri document for the HTML body

# File 'lib/anemone/page.rb', line 166

def doc
  type = Arachni::HTTP.content_type( @headers )
  return if type.is_a?( String) && !type.substring?( 'text' )

  return @doc if @doc
  @doc = Nokogiri::HTML( @body ) if @body rescue nil
end

#extract_domain(url) ⇒ `String`

Extracts the domain from a URI object

Parameters:

url (URI)

Returns:

(String)

# File 'lib/anemone/page.rb', line 282

def extract_domain( url )

    if !url.host then return false end

    splits = url.host.split( /\./ )

    if splits.length == 1 then return true end

    splits[-2] + "." + splits[-1]
end

#fetched? ⇒ `Boolean`

Was the page successfully fetched? true if the page was fetched with no error, false otherwise.

Returns:

(Boolean)



186
187
188

# File 'lib/anemone/page.rb', line 186

def fetched?
  @fetched
end

#html? ⇒ `Boolean`

Returns true if the page is a HTML document, returns false otherwise.

Returns:

(Boolean)



208
209
210

# File 'lib/anemone/page.rb', line 208

def html?
  !!(content_type =~ %r{^(text/html|application/xhtml+xml)\b})
end

#in_domain?(uri) ⇒ `Boolean`

Returns true if uri is in the same domain as the page, returns false otherwise.

The added code enables optional subdomain crawling.

Returns:

(Boolean)

# File 'lib/anemone/page.rb', line 267

def in_domain?( uri )
    if( Arachni::Options.instance.follow_subdomains )
        return extract_domain( uri ) ==  extract_domain( @url )
    end

    uri.host == @url.host
end

#links ⇒ `Array<URI>`

Array of distinct links to follow

Returns:

(Array<URI>)

# File 'lib/anemone/page.rb', line 141

def links
  return @links unless @links.nil?
  @links = []
  return @links if !doc

  run_modules( ).each {
      |path|
      next if path.nil? or path.empty?
      abs = to_absolute( URI( path ) ) rescue next

      if in_domain?( abs )
          @links << abs
          # force dir listing
          # ap to_absolute( get_path( abs.to_s ).to_s ).to_s
          # @links << to_absolute( dir( abs.to_s ).to_s ) rescue next
      end
  }

  @links.uniq!
  return @links
end

#marshal_dump ⇒ `Object`



294
295
296

# File 'lib/anemone/page.rb', line 294

def marshal_dump
  [@url, @headers, @data, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched]
end

#marshal_load(ary) ⇒ `Object`



298
299
300

# File 'lib/anemone/page.rb', line 298

def marshal_load(ary)
  @url, @headers, @data, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched = ary
end

#not_found? ⇒ `Boolean`

Returns true if the page was not found (returned 404 code), returns false otherwise.

Returns:

(Boolean)



224
225
226

# File 'lib/anemone/page.rb', line 224

def not_found?
  404 == @code
end

#redirect? ⇒ `Boolean`

Returns true if the page is a HTTP redirect, returns false otherwise.

Returns:

(Boolean)



216
217
218

# File 'lib/anemone/page.rb', line 216

def redirect?
  (300..307).include?(@code)
end

#run_modules ⇒ `Array`

Runs all Spider (path extraction) modules and returns an array of paths

Returns:

(Array) —

paths

# File 'lib/anemone/page.rb', line 111

def run_modules
    opts = Arachni::Options.instance
    require opts.dir['lib'] + 'component_manager'

    lib = opts.dir['root'] + 'path_extractors/'


    begin
        @@manager ||= ::Arachni::ComponentManager.new( lib, Extractors )

        return @@manager.available.map {
            |name|
            @@manager[name].new.run( doc )
        }.flatten.uniq

    rescue ::Exception => e
        print_error( e.to_s )
        print_debug_backtrace( e )
    end
end

#to_absolute(link) ⇒ `Object`

Converts relative URL link into an absolute URL based on the location of the page

# File 'lib/anemone/page.rb', line 232

def to_absolute(link)
  return nil if link.nil?

  # remove anchor
  link = URI.encode(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,''))

  if url = base
    base_url = URI(url)
  else
    base_url = @url.dup
  end

  relative = URI(link)
  absolute = base_url.merge(relative)

  absolute.path = '/' if absolute.path.empty?

  return absolute
end

#to_hash ⇒ `Object`

# File 'lib/anemone/page.rb', line 302

def to_hash
  {'url' => @url.to_s,
   'headers' => Marshal.dump(@headers),
   'data' => Marshal.dump(@data),
   'body' => @body,
   'links' => links.map(&:to_s),
   'code' => @code,
   'visited' => @visited,
   'depth' => @depth,
   'referer' => @referer.to_s,
   'redirect_to' => @redirect_to.to_s,
   'response_time' => @response_time,
   'fetched' => @fetched}
end

Class: Anemone::Page

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Arachni::UI::Output

Constructor Details

#initialize(url, params = {}) ⇒ Page

Instance Attribute Details

#body ⇒ Object (readonly)

#code ⇒ Object

#data ⇒ Object

#depth ⇒ Object

#error ⇒ Object (readonly)

#headers ⇒ Object (readonly)

#redirect_to ⇒ Object (readonly)

#referer ⇒ Object

#response_time ⇒ Object

#url ⇒ Object (readonly)

#visited ⇒ Object

Class Method Details

.from_hash(hash) ⇒ Object

Instance Method Details

#base ⇒ Object

#content_type ⇒ Object

#cookies ⇒ Object

#dir(url) ⇒ Object

#discard_doc! ⇒ Object

#doc ⇒ Object

#extract_domain(url) ⇒ String

#fetched? ⇒ Boolean

#html? ⇒ Boolean

#in_domain?(uri) ⇒ Boolean

#links ⇒ Array<URI>

#marshal_dump ⇒ Object

#marshal_load(ary) ⇒ Object

#not_found? ⇒ Boolean

#redirect? ⇒ Boolean

#run_modules ⇒ Array

#to_absolute(link) ⇒ Object

#to_hash ⇒ Object

#initialize(url, params = {}) ⇒ `Page`

#body ⇒ `Object` (readonly)

#code ⇒ `Object`

#data ⇒ `Object`

#depth ⇒ `Object`

#error ⇒ `Object` (readonly)

#headers ⇒ `Object` (readonly)

#redirect_to ⇒ `Object` (readonly)

#referer ⇒ `Object`

#response_time ⇒ `Object`

#url ⇒ `Object` (readonly)

#visited ⇒ `Object`

.from_hash(hash) ⇒ `Object`

#base ⇒ `Object`

#content_type ⇒ `Object`

#cookies ⇒ `Object`

#dir(url) ⇒ `Object`

#discard_doc! ⇒ `Object`

#doc ⇒ `Object`

#extract_domain(url) ⇒ `String`

#fetched? ⇒ `Boolean`

#html? ⇒ `Boolean`

#in_domain?(uri) ⇒ `Boolean`

#links ⇒ `Array<URI>`

#marshal_dump ⇒ `Object`

#marshal_load(ary) ⇒ `Object`

#not_found? ⇒ `Boolean`

#redirect? ⇒ `Boolean`

#run_modules ⇒ `Array`

#to_absolute(link) ⇒ `Object`

#to_hash ⇒ `Object`