Class: SiteMapper::Robots::ParsedRobots
- Inherits:
-
Object
- Object
- SiteMapper::Robots::ParsedRobots
- Defined in:
- lib/site_mapper/robots.rb
Overview
Parses robots.txt
Instance Method Summary collapse
-
#allowed?(uri, user_agent) ⇒ Boolean
True if uri is allowed to be crawled.
-
#crawl_delay(user_agent) ⇒ Integer
Crawl delay for user_agent.
-
#initialize(body, user_agent) ⇒ ParsedRobots
constructor
Initializes ParsedRobots.
-
#other_values ⇒ Hash
Return key/value paris with unknown meaning.
-
#parse(body) ⇒ Object
Parse robots.txt body.
-
#sitemaps ⇒ Array
Returns sitemaps defined in robots.txt.
Constructor Details
#initialize(body, user_agent) ⇒ ParsedRobots
Initializes ParsedRobots
10 11 12 13 14 15 16 17 |
# File 'lib/site_mapper/robots.rb', line 10 def initialize(body, user_agent) @other = {} @disallows = {} @allows = {} @delays = {} @sitemaps = [] parse(body) end |
Instance Method Details
#allowed?(uri, user_agent) ⇒ Boolean
Returns true if uri is allowed to be crawled.
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/site_mapper/robots.rb', line 59 def allowed?(uri, user_agent) return true unless @parsed allowed = true path = uri.request_uri user_agent.downcase! @disallows.each do |key, value| if user_agent =~ key value.each do |rule| if path =~ rule allowed = false end end end end @allows.each do |key, value| unless allowed if user_agent =~ key value.each do |rule| if path =~ rule allowed = true end end end end end allowed end |
#crawl_delay(user_agent) ⇒ Integer
Returns crawl delay for user_agent.
92 93 94 95 96 |
# File 'lib/site_mapper/robots.rb', line 92 def crawl_delay(user_agent) agent = user_agent.dup agent = to_regex(agent.downcase) if user_agent.is_a?(String) @delays[agent] end |
#other_values ⇒ Hash
Return key/value paris with unknown meaning.
100 101 102 |
# File 'lib/site_mapper/robots.rb', line 100 def other_values @other end |
#parse(body) ⇒ Object
Parse robots.txt body.
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/site_mapper/robots.rb', line 21 def parse(body) agent = /.*/ body = body || "User-agent: *\nAllow: /\n" body = body.downcase body.each_line.each do |line| next if line =~ /^\s*(#.*|$)/ arr = line.split(':') key = arr.shift value = arr.join(':').strip value.strip! case key when 'user-agent' agent = to_regex(value) when 'allow' @allows[agent] ||= [] @allows[agent] << to_regex(value) when 'disallow' @disallows[agent] ||= [] @disallows[agent] << to_regex(value) when 'crawl-delay' @delays[agent] = value.to_i when 'sitemap' @sitemaps << value else @other[key] ||= [] @other[key] << value end end @parsed = true end |
#sitemaps ⇒ Array
Returns sitemaps defined in robots.txt
105 106 107 |
# File 'lib/site_mapper/robots.rb', line 105 def sitemaps @sitemaps end |