Saper

Saper is a web automation library written in Ruby. It allows to crawl websites and extract data in an efficient, controllable and fault-tolerant manner.

Common use scenarios:

  • scrape a website and save data in a structured format
  • create an RSS feed for a website or a web application
  • create an API for a website that doesn't have one

Gem Version Build Status Code Climate

Installing

Make sure you have Ruby and RubyGems, then run:

gem install saper

Recipes

Recipe is the core element of Saper. It is a chain of actions run consecutively, so that each action processes output of the preceding action. You may create recipes by instantiating Ruby classes, however using embedded DSL is the preferred method.

Here's a recipe that produces a list of recent Bloomberg articles in the 'worldwide news' section (note that this data is unavailable via RSS):

recipe :bloomberg do
  set_input "http://www.bloomberg.com/news/worldwide"
  fetch
  convert_to_html
  find ".news_item a"
  get_attribute "href"
  prepend_with "http://bloomberg.com"
end

Given that file is saved as myrecipe.txt, you can now use the command line:

$ saper myrecipe.txt -recipe bloomberg

Alternatively, you can use Ruby:

#!/usr/bin/env ruby
Saper.run_from_file("myrecipe.txt", :bloomberg).serialize

Data flow

Data flows from one action to another so that output of each action is used as input for the next one. Using the example above:

set_input "http://www.bloomberg.com/news/worldwide"
> String
fetch
> Document
convert_to_html
> HTML
find ".news_item a"
> [HTML, HTML, ... ]
get_attribute "href"
> [String, String, ... ]
prepend_with "http://bloomberg.com"
> [String, String, ... ]

Whenever an action returns multiple results (e.g. find returns multiple HTML nodes) the following action will run several times as well. For instance, if find returns 10 elements, then get_attribute will run 10 times (and produce 10 elements).

If any action fails (e.g. links has no href attribute), Saper will silently skip it and proceed with the rest. All errors are logged and available for subsequent inspection, but no error will ever stop the execution of a recipe -- this is the core idea behind Saper.

Available actions

Below is a list of all available actions:

Downloading information

  • fetch - download data from URL.
  • convert_to_json - parse downloaded data as JSON.
  • convert_to_html - parse downloaded data as HTML.
  • convert_to_xml - parse downloaded data as XML.

String manipulations

  • convert_to_time (format, timezone) - convert string to time.
  • append_with (string) - concatenate input with string.
  • prepend_with (string) - concatenate string with input.
  • remove_after (separator) - search for separator and remove part of input that follows the first occurrence.
  • remove_before (separator) - search for separator and remove part of input that precedes the first occurrence.
  • remove_matching (regexp) - stop recipe execution for strings that don't match the specified pattern.
  • replace (string, string) - substitute one block of text with another.
  • select_matching (regexp) - continue recipe execution only for those strings that match the specified pattern.
  • split (separator) - split string into multiple parts using specified separator.

HTML / XML manipulations

  • convert_to_markdown - return tag contents converted to markdown.
  • find (xpath) - return nodes matching specified XPath or CSS selector.
  • find_first (xpath) - return the first node matching specified XPath or CSS selector.
  • get_attribute (name) - returns value of specified attribute.
  • get_contents - returns tag contents including any child tags (similar to inner_html).
  • get_text - returns text contents of a tag, skipping any tags (similar to inner_text).
  • remove_tags (name) - removes child tags including content.
  • skip_tags (name) - removes child tags preserving their content.

Special-purpose actions

  • set_input (string) - sets input for the following action.
  • create_atom - create an Atom from saved variables.
  • run_recipe_and_save (recipe, variable) - run another recipe and save its result as a variable.
  • run_recipe (recipe) - run another recipe and use it's output as input for the next action.
  • save (variable) - save input as a variable.

Contributing

  • Find something you would like to work on.
  • Fork the project and do your work in a topic branch.
  • Make sure your changes will work on both Ruby 1.8.7 and Ruby 1.9.
  • Add tests in spec/ folder for the behavior you want to test.
  • Run all the tests using rake spec.
  • Commit your changes and send a pull request.

License

Copyright (c) 2013 Merimond Corporation. MIT license, see LICENSE for details.