Welcome to Sinew
Sinew collects structured data from web sites (screen scraping). It provides a Ruby DSL built for crawling, a robust caching system, and integration with Nokogiri. Though small, this project is the culmination of years of effort based on crawling systems built at several different companies.
Sinew is distributed as a ruby gem:
$ gem install sinew
or in your Gemfile:
gem 'sinew'
Table of Contents
Sinew 3 (May 2021)
I am pleased to announce the release of Sinew 3.0. Sinew has been streamlined and updated to use the Faraday HTTP client with sinew middleware for caching.
Breaking change
Sinew 3 uses a new format for cached responses. Old Sinew 2 cache directories should be removed before running Sinew again.
Quick Example
Here's an example for collecting the links from httpbingo.org:
# get the url
get "http://httpbingo.org"
# use nokogiri to collect links
noko.css("ul li a").each do |a|
row = { }
row[:url] = a[:href]
row[:title] = a.text
# append a row to the csv
csv_emit(row)
end
If you paste this into a file called sample.sinew and run sinew sample.sinew, it will create a sample.csv file containing the href and text for each link.
How it Works
There are three main features provided by Sinew.
The Sinew DSL
Sinew uses recipe files to crawl web sites. Recipes have the .sinew extension, but they are plain old Ruby. The Sinew DSL makes crawling easy. Use get to make an HTTP GET:
get "https://www.google.com/search?q=darwin"
get "https://www.google.com/search", q: "charles darwin"
Once you've done a get, you have access to the document in a few different formats. In general, it's easiest to use noko to automatically parse and interact with the results. If Nokogiri isn't appropriate, you can fall back to regular expressions run against raw or html. Use json if you are expecting a JSON response.
get "https://www.google.com/search?q=darwin"
# pull out the links with nokogiri
links = noko.css("a").map { |i| i[:href] }
puts links.inspect
# or, use a regex
links = html[/<a[^>]+href="([^"]+)/, 1]
puts links.inspect
CSV Output
Recipes output CSV files. To continue the example above:
get "https://www.google.com/search?q=darwin"
noko.css("a").each do |i|
row = { }
row[:href] = i[:href]
row[:text] = i.text
csv_emit row
end
Sinew creates a CSV file with the same name as the recipe, and csv_emit(hash) appends a row. The values of your hash are converted to strings:
- Nokogiri nodes are converted to text
- Arrays are joined with "|", so you can separate them later
- HTML tags, entities and non-ascii chars are removed
- Whitespace is squished
Caching
Sinew uses sinew to aggressively cache all HTTP responses to disk in ~/.sinew. Error responses are cached as well. Each URL will be hit exactly once, and requests are rate limited to one per second. Sinew tries to be polite.
Sinew never deletes files from the cache - that's up to you!
Because all requests are cached, you can run Sinew repeatedly with confidence. Run it over and over again while you build up your recipe.
DSL Reference
Making requests
get(url, query = {})- fetch a url with HTTP GET. URL parameters can be added using `query.post(url, form = {})- fetch a url with HTTP POST, usingformas the URL encoded POST body.post_json(url, json = {})- fetch a url with HTTP POST, usingjsonas the POST body.http(method, url, options = {})- use this for more complex requests
Parsing the response
These variables are set after each HTTP request.
raw- the raw response from the last requesthtml- likeraw, but with a handful of HTML-specific whitespace cleanupsnoko- parse the response as HTML and return a Nokogiri documentxml- parse the response as XML and return a Nokogiri documentjson- parse the response as JSON, with symbolized keysurl- the url of the last request. If the request goes through a redirect,urlwill reflect the final url.uri- the URI of the last request. This is useful for resolving relative URLs.
Writing CSV
csv_header(keys)- specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call tocsv_emit.csv_emit(hash)- append a row to the CSV file
Hints
Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:
- Sinew doesn't (yet) check robots.txt - please check it manually.
- Prefer Nokogiri over regular expressions wherever possible. Learn CSS selectors.
- In Chrome,
$in the console is your friend. - Fallback to regular expressions if you're desperate. Depending on the site, use either
raworhtml.htmlis probably your best bet.rawis good for crawling Javascript, but it's fragile if the site changes. - Learn to love
String#[regexp], which is an obscure operator but incredibly handy for Sinew. - Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
- Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
- Debug your recipes using plain old
puts, or better yet useapfrom amazing_print. - Run
sinew -vto get a report on everycsv_emit. Very handy. - Add the CSV files to your git repo. That way you can version them and get diffs!
Limitations
- Caching is based on URL, so use caution with cookies and other forms of authentication
- Almost no support for international (non-english) characters
Changelog
3.0.0 (May 2021)
- Major rewrite of network and caching layer. See above.
- Use Faraday HTTP client with sinew middleware for caching.
- Supports multiple proxies (
--proxy host1,host2,...)
2.0.4 (May 2018)
- Handle and cache more errors (too many redirects, connection failures, etc.)
- Support for adding uri.scheme in generate_cache_key
- Added status
code, a peer touri,raw, etc.
2.0.3 (May 2018)
- & now normalizes to & (not and)
2.0.2 (May 2018)
- Support for
--limit,--proxyand thexmlvariable - Dedup - warn and ignore if row[:url] has already been emitted
- Auto gunzip if contents are compressed
2.0.1 (May 2018)
- Support for legacy cached
headfiles from Sinew 1
2.0.0 (May 2018)
- Complete rewrite. See above.
1.0.3 (June 2012)
...
License
This extension is licensed under the MIT License.