Class: Robotstxt::Parser

Inherits:

Object

Object
Robotstxt::Parser

show all

Defined in:: lib/robotstxt/parser.rb

Instance Attribute Summary collapse

#body ⇒ Object readonly

Returns the value of attribute body.
#found ⇒ Object readonly

Returns the value of attribute found.
#robot_id ⇒ Object

Returns the value of attribute robot_id.
#rules ⇒ Object readonly

Returns the value of attribute rules.
#sitemaps ⇒ Object readonly

Analyze the robots.txt file to return an Array containing the list of XML Sitemaps URLs.

Instance Method Summary collapse

#allowed?(var) ⇒ Boolean

Check if the URL is allowed to be crawled from the current Robot_id.
#found? ⇒ Boolean

This method returns true if the Robots.txt parsing is gone.
#get(hostname) ⇒ Object

Requires and parses the Robots.txt file for the hostname.
#initialize(robot_id = nil) ⇒ Parser constructor

Initializes a new Robots::Robotstxtistance with robot_id option.

Constructor Details

#initialize(robot_id = nil) ⇒ `Parser`

Initializes a new Robots::Robotstxtistance with robot_id option.

client = Robotstxt::Robotstxtistance.new('my_robot_id')

# File 'lib/robotstxt/parser.rb', line 29

def initialize(robot_id = nil)
  
  @robot_id = '*'
  @rules = []
  @sitemaps = []
  @robot_id = robot_id.downcase if !robot_id.nil?
  
end

Instance Attribute Details

#body ⇒ `Object` (readonly)

Returns the value of attribute body.



23
24
25

# File 'lib/robotstxt/parser.rb', line 23

def body
  @body
end

#found ⇒ `Object` (readonly)

Returns the value of attribute found.



23
24
25

# File 'lib/robotstxt/parser.rb', line 23

def found
  @found
end

#robot_id ⇒ `Object`

Returns the value of attribute robot_id.



22
23
24

# File 'lib/robotstxt/parser.rb', line 22

def robot_id
  @robot_id
end

#rules ⇒ `Object` (readonly)

Returns the value of attribute rules.



23
24
25

# File 'lib/robotstxt/parser.rb', line 23

def rules
  @rules
end

#sitemaps ⇒ `Object` (readonly)

Analyze the robots.txt file to return an Array containing the list of XML Sitemaps URLs.

client = Robotstxt::Robotstxtistance.new('my_robot_id')
if client.get('http://www.simonerinzivillo.it')
  client.sitemaps.each{ |url|
  puts url
}
end



125
126
127

# File 'lib/robotstxt/parser.rb', line 125

def sitemaps
  @sitemaps
end

Instance Method Details

#allowed?(var) ⇒ `Boolean`

Check if the URL is allowed to be crawled from the current Robot_id.

client = Robotstxt::Robotstxtistance.new('my_robot_id')
if client.get('http://www.simonerinzivillo.it')
  client.allowed?('http://www.simonerinzivillo.it/no-dir/')
end

This method returns true if the robots.txt file does not block the access to the URL.

Returns:

(Boolean)

# File 'lib/robotstxt/parser.rb', line 94

def allowed?(var)
  is_allow = true
  url = URI.parse(var)
  querystring = (!url.query.nil?) ? '?' + url.query : ''
  url_path = url.path + querystring
  
  @rules.each {|ua|
    
    if @robot_id == ua[0] || ua[0] == '*' 
      
      ua[1].each {|d|
        
        is_allow = false if url_path.match('^' + d ) || d == '/'
        
      }
      
    end
    
  }
  is_allow
end

#found? ⇒ `Boolean`

This method returns true if the Robots.txt parsing is gone.

Returns:

(Boolean)



131
132
133

# File 'lib/robotstxt/parser.rb', line 131

def found?
  !!@found
end

#get(hostname) ⇒ `Object`

Requires and parses the Robots.txt file for the hostname.

client = Robotstxt::Robotstxtistance.new('my_robot_id')
client.get('http://www.simonerinzivillo.it')

This method returns true if the parsing is gone.

# File 'lib/robotstxt/parser.rb', line 47

def get(hostname)
  
  @ehttp = true
  url = URI.parse(hostname)
  
  begin
    http = Net::HTTP.new(url.host, url.port)
    if url.scheme == 'https'
      http.verify_mode = OpenSSL::SSL::VERIFY_NONE
      http.use_ssl = true 
    end
    
    response =  http.request(Net::HTTP::Get.new('/robots.txt'))
    
    case response
      when Net::HTTPSuccess then
      @found = true
      @body = response.body
      parse()    
      
      else
      @found = false
    end 
    
    return @found
    
    rescue Timeout::Error, Errno::EINVAL, Errno::ECONNRESET => e
    if @ehttp
      @ettp = false
      retry 
      else
      return nil
    end
  end
  
end

Class: Robotstxt::Parser

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(robot_id = nil) ⇒ Parser

Instance Attribute Details

#body ⇒ Object (readonly)

#found ⇒ Object (readonly)

#robot_id ⇒ Object

#rules ⇒ Object (readonly)

#sitemaps ⇒ Object (readonly)