A simple web crawler for ruby.
- Run single-thread or multi-thread.
- Pool HTTP connection.
- Restrict links by url-based patterns.
- Respect robots.txt.
- Store page contents via adapter.
- ruby 2.3+
Add to your application's Gemfile:
$ bundle install
Crawl html files in
crawler = ::. do user_agent 'YOUR_AWESOME_APP' add_filter do focus_host true allow_mime_type %w(text/html) end end crawler.run('http://example.com/') do on_success do |page, | puts page.url end end
This gem supports only in-memory crawling by default. Use following adapter to save page contents persistently:
Bug reports and pull requests are welcome on GitHub at https://github.com/kanety/kudzu. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
The gem is available as open source under the terms of the MIT License.