Class: Robotstxt::Parser

Inherits:

Object

Object
Robotstxt::Parser

Includes:: CommonMethods

Defined in:: lib/robotstxt/parser.rb

Overview

Parses robots.txt files for the perusal of a single user-agent.

The behaviour implemented is guided by the following sources, though as there is no widely accepted standard, it may differ from other implementations. If you consider its behaviour to be in error, please contact the author.

www.robotstxt.org/orig.html

- the original, now imprecise and outdated version

www.robotstxt.org/norobots-rfc.txt

- a much more precise, outdated version

www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237

- a few hints at modern protocol extensions.

This parser only considers lines starting with (case-insensitively:)

Useragent: User-agent: Allow: Disallow: Sitemap:

The file is divided into sections, each of which contains one or more User-agent: lines, followed by one or more Allow: or Disallow: rules.

The first section that contains a User-agent: line that matches the robot’s user-agent, is the only section that relevent to that robot. The sections are checked in the same order as they appear in the file.

(The * character is taken to mean “any number of any characters” during matching of

user-agents)

Within that section, the first Allow: or Disallow: rule that matches the expression is taken as authoritative. If no rule in a section matches, the access is Allowed.

(The order of matching is as in the RFC, Google matches all Allows and then all Disallows,

while Bing matches the most specific rule, I'm sure there are other interpretations)

When matching urls, all % encodings are normalised (except for /?=& which have meaning) and “*”s match any number of any character.

If a pattern ends with a $, then the pattern must match the entire path, or the entire path with query string.

Instance Attribute Summary collapse

#sitemaps ⇒ Object readonly

Gets every Sitemap mentioned in the body of the robots.txt file.

Instance Method Summary collapse

#allowed?(uri) ⇒ Boolean

Given a URI object, or a string representing one, determine whether this robots.txt would allow access to the path.
#initialize(user_agent, body) ⇒ Parser constructor

Create a new parser for this user_agent and this robots.txt contents.

Constructor Details

#initialize(user_agent, body) ⇒ `Parser`

Create a new parser for this user_agent and this robots.txt contents.

This assumes that the robots.txt is ready-to-parse, in particular that it has been decoded as necessary, including removal of byte-order-marks et.al.

Not passing a body is deprecated, but retained for compatibility with clients written for version 0.5.4.

# File 'lib/robotstxt/parser.rb', line 56

def initialize(user_agent, body)
  @robot_id = user_agent
  @found = true
  parse(body) # set @body, @rules and @sitemaps
end

Instance Attribute Details

#sitemaps ⇒ `Object` (readonly)

Gets every Sitemap mentioned in the body of the robots.txt file.



46
47
48

# File 'lib/robotstxt/parser.rb', line 46

def sitemaps
  @sitemaps
end

Instance Method Details

#allowed?(uri) ⇒ `Boolean`

Given a URI object, or a string representing one, determine whether this robots.txt would allow access to the path.

Returns:

(Boolean)

# File 'lib/robotstxt/parser.rb', line 64

def allowed?(uri)

  uri = objectify_uri(uri)
  path = (uri.path || "/") + (uri.query ? '?' + uri.query : '')
  path_allowed?(@robot_id, path)

end