MechWarrior

MechWarrior is a Mechanize and Celluloid powered site crawler that generates a JSON file of all pages, links on pages, and assets those pages rely upon as well as optionally generating an XML sitemap compliant with sitemaps 0.9 protocol.

Version

0.0.1

Tech

MechWarrior relies on several excellent RubyGems

  • Mechanize - a ruby library that makes automated web interaction easy.
  • Celluloid - an Actor-based concurrent object framework for Ruby
  • XML-Sitemap - provides easy XML sitemap generation for Ruby/Rails/Merb/Sinatra applications

Installation

gem install mech_warrior-0.0.1.gem

Crawling a site

bin/spider

and enter a host name, followed by any additional options you wish to pass in to override default options in lib/mech_warrior.rb

Todo

Some of the functionality, including XML Sitemaps, is untested. Support for multiple hosts in a single spider is currently incomplete, despite the 'allowed_hosts' array, unless all but default host have only absolute links to follow.

License

MIT