Module: Robotstxt
- Defined in:
- lib/robotstxt.rb,
lib/robotstxt/common.rb,
lib/robotstxt/getter.rb,
lib/robotstxt/parser.rb
Overview
Provides a flexible interface to help authors of web-crawlers respect the robots.txt exclusion standard.
Defined Under Namespace
Modules: CommonMethods Classes: Getter, Parser
Constant Summary collapse
- NAME =
'Robotstxt'
- GEM =
'robotstxt'
- AUTHORS =
['Conrad Irwin <[email protected]>', 'Simone Rinzivillo <[email protected]>']
- VERSION =
'1.0'
Class Method Summary collapse
-
.get(source, robot_id, options = {}) ⇒ Object
Obtains and parses a robotstxt file from the host identified by source, source can either be a URI, a string representing a URI, or a Net::HTTP connection associated with a host.
-
.get_allowed?(uri, robot_id) ⇒ Boolean
Gets a robotstxt file from the host identified by the uri (which can be a URI object or a string).
-
.parse(robotstxt, robot_id) ⇒ Object
Parses the contents of a robots.txt file for the given robot_id.
- .ultimate_scrubber(str) ⇒ Object
Class Method Details
.get(source, robot_id, options = {}) ⇒ Object
Obtains and parses a robotstxt file from the host identified by source, source can either be a URI, a string representing a URI, or a Net::HTTP connection associated with a host.
The second parameter should be the user-agent header for your robot.
There are currently two options:
:num_redirects (default 5) is the maximum number of HTTP 3** responses
the get() method will accept and follow the Location: header before
giving up.
:http_timeout (default 10) is the number of seconds to wait for each
request before giving up.
:url_charset (default "utf8") the character encoding you will use to
encode urls.
As indicated by robotstxt.org, this library treats HTTPUnauthorized and HTTPForbidden as though the robots.txt file denied access to the entire site, all other HTTP responses or errors are treated as though the site allowed all access.
The return value is a Robotstxt::Parser, which you can then interact with by calling .allowed? or .sitemaps. i.e.
Robotstxt.get(“example.com/”, “SuperRobot”).allowed? “/index.html”
Net::HTTP.open(“example.com”) do |http|
if Robotstxt.get(http, "SuperRobot").allowed? "/index.html"
http.get("/index.html")
end
end
61 62 63 |
# File 'lib/robotstxt.rb', line 61 def self.get(source, robot_id, ={}) self.parse(Getter.new.obtain(source, robot_id, ), robot_id) end |
.get_allowed?(uri, robot_id) ⇒ Boolean
Gets a robotstxt file from the host identified by the uri
(which can be a URI object or a string)
Parses it for the given robot_id
(which should be your user-agent)
Returns true iff your robot can access said uri.
Robotstxt.get_allowed? “www.example.com/good”, “SuperRobot”
86 87 88 |
# File 'lib/robotstxt.rb', line 86 def self.get_allowed?(uri, robot_id) self.get(uri, robot_id).allowed? uri end |
.parse(robotstxt, robot_id) ⇒ Object
Parses the contents of a robots.txt file for the given robot_id
Returns a Robotstxt::Parser object with methods .allowed? and .sitemaps, i.e.
Robotstxt.parse(“User-agent: *nDisallow: /a”, “SuperRobot”).allowed? “/b”
72 73 74 |
# File 'lib/robotstxt.rb', line 72 def self.parse(robotstxt, robot_id) Parser.new(robot_id, robotstxt) end |
.ultimate_scrubber(str) ⇒ Object
90 91 92 |
# File 'lib/robotstxt.rb', line 90 def self.ultimate_scrubber(str) str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '') end |