Class: RubyCrawl::RobotsParser
- Inherits:
-
Object
- Object
- RubyCrawl::RobotsParser
- Defined in:
- lib/rubycrawl/robots_parser.rb
Overview
Fetches and parses robots.txt for a given site. Supports User-agent: *, Disallow, Allow, and Crawl-delay directives. Fails open — any fetch/parse error allows all URLs.
Class Method Summary collapse
-
.fetch(base_url) ⇒ Object
Fetch robots.txt from base_url and return a parser instance.
Instance Method Summary collapse
-
#allowed?(url) ⇒ Boolean
Returns true if the given URL is allowed to be crawled.
-
#crawl_delay ⇒ Object
Returns the Crawl-delay value in seconds, or nil if not specified.
-
#initialize(content) ⇒ RobotsParser
constructor
A new instance of RobotsParser.
Constructor Details
#initialize(content) ⇒ RobotsParser
Returns a new instance of RobotsParser.
26 27 28 |
# File 'lib/rubycrawl/robots_parser.rb', line 26 def initialize(content) @rules = parse(content.to_s) end |
Class Method Details
.fetch(base_url) ⇒ Object
Fetch robots.txt from base_url and return a parser instance. Returns a permissive (allow-all) instance on any network error.
13 14 15 16 17 18 19 20 21 22 23 24 |
# File 'lib/rubycrawl/robots_parser.rb', line 13 def self.fetch(base_url) uri = URI.join(base_url, '/robots.txt') response = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https', open_timeout: 5, read_timeout: 5) do |http| http.get(uri.request_uri) end new(response.is_a?(Net::HTTPOK) ? response.body : '') rescue StandardError new('') # network error or invalid URL → allow everything end |
Instance Method Details
#allowed?(url) ⇒ Boolean
Returns true if the given URL is allowed to be crawled.
31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/rubycrawl/robots_parser.rb', line 31 def allowed?(url) path = URI.parse(url).path path = '/' if path.nil? || path.empty? # Allow rules take precedence over Disallow when both match. return true if @rules[:allow].any? { |rule| path_matches?(path, rule) } return false if @rules[:disallow].any? { |rule| path_matches?(path, rule) } true rescue URI::InvalidURIError true end |
#crawl_delay ⇒ Object
Returns the Crawl-delay value in seconds, or nil if not specified.
45 46 47 |
# File 'lib/rubycrawl/robots_parser.rb', line 45 def crawl_delay @rules[:crawl_delay] end |