Class: Robotstxt::Parser
- Inherits:
-
Object
- Object
- Robotstxt::Parser
- Includes:
- CommonMethods
- Defined in:
- lib/robotstxt/parser.rb
Overview
Parses robots.txt files for the perusal of a single user-agent.
The behaviour implemented is guided by the following sources, though as there is no widely accepted standard, it may differ from other implementations. If you consider its behaviour to be in error, please contact the author.
- the original, now imprecise and outdated version
www.robotstxt.org/norobots-rfc.txt
- a much more precise, outdated version
www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
- a few hints at modern protocol extensions.
This parser only considers lines starting with (case-insensitively:)
Useragent: User-agent: Allow: Disallow: Sitemap:
The file is divided into sections, each of which contains one or more User-agent: lines, followed by one or more Allow: or Disallow: rules.
The first section that contains a User-agent: line that matches the robot’s user-agent, is the only section that relevent to that robot. The sections are checked in the same order as they appear in the file.
(The * character is taken to mean “any number of any characters” during matching of
user-agents)
Within that section, the first Allow: or Disallow: rule that matches the expression is taken as authoritative. If no rule in a section matches, the access is Allowed.
(The order of matching is as in the RFC, Google matches all Allows and then all Disallows,
while Bing matches the most specific rule, I'm sure there are other interpretations)
When matching urls, all % encodings are normalised (except for /?=& which have meaning) and “*”s match any number of any character.
If a pattern ends with a $, then the pattern must match the entire path, or the entire path with query string.
Instance Attribute Summary collapse
-
#sitemaps ⇒ Object
readonly
Gets every Sitemap mentioned in the body of the robots.txt file.
Instance Method Summary collapse
-
#allowed?(uri) ⇒ Boolean
Given a URI object, or a string representing one, determine whether this robots.txt would allow access to the path.
-
#initialize(user_agent, body) ⇒ Parser
constructor
Create a new parser for this user_agent and this robots.txt contents.
Constructor Details
#initialize(user_agent, body) ⇒ Parser
Create a new parser for this user_agent and this robots.txt contents.
This assumes that the robots.txt is ready-to-parse, in particular that it has been decoded as necessary, including removal of byte-order-marks et.al.
Not passing a body is deprecated, but retained for compatibility with clients written for version 0.5.4.
56 57 58 59 60 |
# File 'lib/robotstxt/parser.rb', line 56 def initialize(user_agent, body) @robot_id = user_agent @found = true parse(body) # set @body, @rules and @sitemaps end |
Instance Attribute Details
#sitemaps ⇒ Object (readonly)
Gets every Sitemap mentioned in the body of the robots.txt file.
46 47 48 |
# File 'lib/robotstxt/parser.rb', line 46 def sitemaps @sitemaps end |
Instance Method Details
#allowed?(uri) ⇒ Boolean
Given a URI object, or a string representing one, determine whether this robots.txt would allow access to the path.
64 65 66 67 68 69 70 |
# File 'lib/robotstxt/parser.rb', line 64 def allowed?(uri) uri = objectify_uri(uri) path = (uri.path || "/") + (uri.query ? '?' + uri.query : '') path_allowed?(@robot_id, path) end |