Class: RubyCrawl::RobotsParser

Inherits:
Object
  • Object
show all
Defined in:
lib/rubycrawl/robots_parser.rb

Overview

Fetches and parses robots.txt for a given site. Supports User-agent: *, Disallow, Allow, and Crawl-delay directives. Fails open — any fetch/parse error allows all URLs.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(content) ⇒ RobotsParser

Returns a new instance of RobotsParser.



26
27
28
# File 'lib/rubycrawl/robots_parser.rb', line 26

def initialize(content)
  @rules = parse(content.to_s)
end

Class Method Details

.fetch(base_url) ⇒ Object

Fetch robots.txt from base_url and return a parser instance. Returns a permissive (allow-all) instance on any network error.



13
14
15
16
17
18
19
20
21
22
23
24
# File 'lib/rubycrawl/robots_parser.rb', line 13

def self.fetch(base_url)
  uri = URI.join(base_url, '/robots.txt')
  response = Net::HTTP.start(uri.host, uri.port,
                             use_ssl:      uri.scheme == 'https',
                             open_timeout: 5,
                             read_timeout: 5) do |http|
    http.get(uri.request_uri)
  end
  new(response.is_a?(Net::HTTPOK) ? response.body : '')
rescue StandardError
  new('') # network error or invalid URL → allow everything
end

Instance Method Details

#allowed?(url) ⇒ Boolean

Returns true if the given URL is allowed to be crawled.

Returns:

  • (Boolean)


31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/rubycrawl/robots_parser.rb', line 31

def allowed?(url)
  path = URI.parse(url).path
  path = '/' if path.nil? || path.empty?

  # Allow rules take precedence over Disallow when both match.
  return true if @rules[:allow].any? { |rule| path_matches?(path, rule) }
  return false if @rules[:disallow].any? { |rule| path_matches?(path, rule) }

  true
rescue URI::InvalidURIError
  true
end

#crawl_delayObject

Returns the Crawl-delay value in seconds, or nil if not specified.



45
46
47
# File 'lib/rubycrawl/robots_parser.rb', line 45

def crawl_delay
  @rules[:crawl_delay]
end