Class: RubyCrawl::RobotsParser

Inherits:

Object

Object
RubyCrawl::RobotsParser

show all

Defined in:: lib/rubycrawl/robots_parser.rb

Overview

Fetches and parses robots.txt for a given site. Supports User-agent: *, Disallow, Allow, and Crawl-delay directives. Fails open — any fetch/parse error allows all URLs.

Class Method Summary collapse

.fetch(base_url) ⇒ Object

Fetch robots.txt from base_url and return a parser instance.

Instance Method Summary collapse

#allowed?(url) ⇒ Boolean

Returns true if the given URL is allowed to be crawled.
#crawl_delay ⇒ Object

Returns the Crawl-delay value in seconds, or nil if not specified.
#initialize(content) ⇒ RobotsParser constructor

A new instance of RobotsParser.

Constructor Details

#initialize(content) ⇒ `RobotsParser`

Returns a new instance of RobotsParser.



26
27
28

# File 'lib/rubycrawl/robots_parser.rb', line 26

def initialize(content)
  @rules = parse(content.to_s)
end

Class Method Details

.fetch(base_url) ⇒ `Object`

Fetch robots.txt from base_url and return a parser instance. Returns a permissive (allow-all) instance on any network error.

# File 'lib/rubycrawl/robots_parser.rb', line 13

def self.fetch(base_url)
  uri = URI.join(base_url, '/robots.txt')
  response = Net::HTTP.start(uri.host, uri.port,
                             use_ssl:      uri.scheme == 'https',
                             open_timeout: 5,
                             read_timeout: 5) do |http|
    http.get(uri.request_uri)
  end
  new(response.is_a?(Net::HTTPOK) ? response.body : '')
rescue StandardError
  new('') # network error or invalid URL → allow everything
end

Instance Method Details

#allowed?(url) ⇒ `Boolean`

Returns true if the given URL is allowed to be crawled.

Returns:

(Boolean)

# File 'lib/rubycrawl/robots_parser.rb', line 31

def allowed?(url)
  path = URI.parse(url).path
  path = '/' if path.nil? || path.empty?

  # Allow rules take precedence over Disallow when both match.
  return true if @rules[:allow].any? { |rule| path_matches?(path, rule) }
  return false if @rules[:disallow].any? { |rule| path_matches?(path, rule) }

  true
rescue URI::InvalidURIError
  true
end

#crawl_delay ⇒ `Object`

Returns the Crawl-delay value in seconds, or nil if not specified.



45
46
47

# File 'lib/rubycrawl/robots_parser.rb', line 45

def crawl_delay
  @rules[:crawl_delay]
end