Staticizer

A tool to create a static version of a website for hosting on S3.

Rationale

One of our clients needed a reliable emergency backup for a website. If the website goes down this backup would be available with reduced functionality.

S3 and Route 53 provide an great way to host a static emergency backup for a website. See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html . In our experience it works well and is incredibly cheap. Our average sized website with a few hundred pages and assets is less than US$1 a month.

We tried using existing tools httrack/wget to crawl and create a static version of the site to upload to S3, but we found that they did not work well with S3 hosting. We wanted the site uploaded to S3 to respond to the exact same URLs (where possible) as the existing site. This way when the site goes down incoming links from Google search results etc. will still work.

TODO

Abillity to specify AWS credentials via file or environment options
Tests!
Decide what to do with URLs with query strings. Currently they are crawled and uploaded to S3, but those keys cannot be accessed. ex http://squaremill.com/file?test=1 will be uploaded with the key file?test=1, but can only be accessed by encoding the ? like this %3Ftest=1
Create a 404 file on S3
Provide the option to rewrite absolute URLs to relative urls so that hosting can work on a different domain.
Multithread the crawler
Check for too many redirects
Provide regex options for what urls are scraped
Better handling of incorrect server mime types (ex. server returns text/plain for css instead of text/css)
Provide more options for uploading (upload via scp, ftp, custom etc.). Split out save/uploading into an interface.
Handle large files in a more memory efficient way by streaming uploads/downloads

Installation

Add this line to your application's Gemfile:

gem 'staticizer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install staticizer

Command line usage

Staticizer can be used through the commandline tool or by requiring the library.

Crawl a website and write to disk

staticizer http://squaremill.com -output-dir=/tmp/crawl

Crawl a website and upload to AWS

staticizer http://squaremill.com -aws-s3-bucket=squaremill.com --aws-access-key=HJFJS5gSJHMDZDFFSSDQQ --aws-secret-key=HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s

Crawl a website and allow several domains to be crawled

staticizer http://squaremill.com --valid-domains=squaremill.com,www.squaremill.com,img.squaremill.com

Code Usage

For all these examples you must first:

require 'staticizer'

Crawl a website and upload to AWS

This will only crawl urls in the domain squaremill.com

s = Staticizer::Crawler.new("http://squaremill.com",
  :aws => {
    :region => "us-west-1",
    :endpoint => "http://s3.amazonaws.com",
    :bucket_name => "www.squaremill.com",
    :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
    :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
  }
)
s.crawl

Crawl a website and write to disk

s = Staticizer::Crawler.new("http://squaremill.com", :output_dir => "/tmp/crawl")
s.crawl

Crawl a website and make all pages contain 'noindex' meta tag

s = Staticizer::Crawler.new("http://squaremill.com",
  :output_dir => "/tmp/crawl",
  :process_body => lambda {|body, uri, opts|
    # not the best regex, but it will do for our use
    body = body.gsub(/<meta\s+name=['"]robots[^>]+>/i,'')
    body = body.gsub(/<head>/i,"<head>\n<meta name='robots' content='noindex'>")
    body
  }
)
s.crawl

Crawl a website and rewrite all non www urls to www

s = Staticizer::Crawler.new("http://squaremill.com",
  :aws => {
    :region => "us-west-1",
    :endpoint => "http://s3.amazonaws.com",
    :bucket_name => "www.squaremill.com",
    :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
    :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
  },
  :filter_url => lambda do |url, info|
    # Only crawl URL if it matches squaremill.com or www.squaremil.com
    if url =~ %r{https?://(www\.)?squaremill\.com}
      # Rewrite non-www urls to www
      return url.gsub(%r{https?://(www\.)?squaremill\.com}, "http://www.squaremill.com")
    end
    # returning nil here prevents the url from being crawled
  end
)
s.crawl

Crawler Options

:aws - Hash of connection options passed to aws/sdk gem
:filter_url - lambda called to see if a discovered URL should be crawled, return the url (can be modified) to crawl, return nil otherwise
:output_dir - if writing a site to disk the directory to write to, will be created if it does not exist
:logger - A logger object responding to the usual Ruby Logger methods.
:log_level - Log level - defaults to INFO.
:valid_domains - Array of domains that should be crawled. Domains not in this list will be ignored.
:process_body - lambda called to pre-process body of content before writing it out.
:skip_write - don't write retrieved files to disk or s3, just crawl the site (can be used to find 404s etc.)

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request