Module: Robotstxt

Defined in:: lib/robotstxt.rb,
lib/robotstxt/common.rb,
lib/robotstxt/getter.rb,
lib/robotstxt/parser.rb

Overview

Provides a flexible interface to help authors of web-crawlers respect the robots.txt exclusion standard.

Defined Under Namespace

Modules: CommonMethods Classes: Getter, Parser

Constant Summary collapse

NAME =

'Robotstxt'

GEM =

'robotstxt'

AUTHORS =

['Conrad Irwin <[email protected]>', 'Simone Rinzivillo <[email protected]>']

VERSION =

'1.0'

Class Method Summary collapse

.get(source, robot_id, options = {}) ⇒ Object

Obtains and parses a robotstxt file from the host identified by source, source can either be a URI, a string representing a URI, or a Net::HTTP connection associated with a host.
.get_allowed?(uri, robot_id) ⇒ Boolean

Gets a robotstxt file from the host identified by the uri (which can be a URI object or a string).
.parse(robotstxt, robot_id) ⇒ Object

Parses the contents of a robots.txt file for the given robot_id.
.ultimate_scrubber(str) ⇒ Object

Class Method Details

.get(source, robot_id, options = {}) ⇒ `Object`

Obtains and parses a robotstxt file from the host identified by source, source can either be a URI, a string representing a URI, or a Net::HTTP connection associated with a host.

The second parameter should be the user-agent header for your robot.

There are currently two options:

:num_redirects (default 5) is the maximum number of HTTP 3** responses
 the get() method will accept and follow the Location: header before
 giving up.
:http_timeout (default 10) is the number of seconds to wait for each
 request before giving up.
:url_charset (default "utf8") the character encoding you will use to
 encode urls.

As indicated by robotstxt.org, this library treats HTTPUnauthorized and HTTPForbidden as though the robots.txt file denied access to the entire site, all other HTTP responses or errors are treated as though the site allowed all access.

The return value is a Robotstxt::Parser, which you can then interact with by calling .allowed? or .sitemaps. i.e.

Robotstxt.get(“example.com/”, “SuperRobot”).allowed? “/index.html”

Net::HTTP.open(“example.com”) do |http|

if Robotstxt.get(http, "SuperRobot").allowed? "/index.html"
  http.get("/index.html")
end

end



61
62
63

# File 'lib/robotstxt.rb', line 61

def self.get(source, robot_id, options={})
  self.parse(Getter.new.obtain(source, robot_id, options), robot_id)
end

.get_allowed?(uri, robot_id) ⇒ `Boolean`

Gets a robotstxt file from the host identified by the uri

(which can be a URI object or a string)

Parses it for the given robot_id

(which should be your user-agent)

Returns true iff your robot can access said uri.

Robotstxt.get_allowed? “www.example.com/good”, “SuperRobot”

Returns:

(Boolean)



86
87
88

# File 'lib/robotstxt.rb', line 86

def self.get_allowed?(uri, robot_id)
  self.get(uri, robot_id).allowed? uri
end

.parse(robotstxt, robot_id) ⇒ `Object`

Parses the contents of a robots.txt file for the given robot_id

Returns a Robotstxt::Parser object with methods .allowed? and .sitemaps, i.e.

Robotstxt.parse(“User-agent: *nDisallow: /a”, “SuperRobot”).allowed? “/b”



72
73
74

# File 'lib/robotstxt.rb', line 72

def self.parse(robotstxt, robot_id)
  Parser.new(robot_id, robotstxt)
end

.ultimate_scrubber(str) ⇒ `Object`



90
91
92

# File 'lib/robotstxt.rb', line 90

def self.ultimate_scrubber(str)
  str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '')
end