Spider, a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data.

Usage

Spider.start_at('http://mike-burns.com/') do |s|
  # Limit the pages to just this domain.
  s.add_url_check do |a_url|
    a_url =~ %r{^http://mike-burns.com.*}
  end

  # Handle 404s.
  s.on 404 do |a_url, resp, prior_url|
    puts "URL not found: #{a_url}"
  end

  # Handle 2xx.
  s.on :success do |a_url, resp, prior_url|
    puts "body: #{resp.body}"
  end

  # Handle everything.
  s.on :every do |a_url, resp, prior_url|
    puts "URL returned anything: #{a_url} with this code #{resp.code}"
  end
end

Requirements

This library uses `robot_rules' (included), `open-uri', and `uri'. Any modern Ruby should work; if yours doesn't, let me know so I can update this with your version number.

Author

Mike Burns mike-burns.com mike@mike-burns.com

With help from Matt Horan and John Nagro. With `robot_rules' from James Edward Gray II via blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589