Snapcrawl - crawl a website and take screenshots

Snapcrawl is a command line utility for crawling a website and saving screenshots.

Features

Crawls a website to any given depth and save screenshots
Can capture the full length of the page
Can use a specific resolution for screenshots
Skips capturing if the screenshot was already saved recently
Uses local caching to avoid expensive crawl operations if not needed
Reports broken links

Prerequisites

Snapcrawl requires PhantomJS and ImageMagick.

Docker Image

You can run Snapcrawl by using this docker image (which contains all the necessary prerequisites):

$ docker pull dannyben/snapcrawl

Then you can use it like this:

$ docker run --rm -it dannyben/snapcrawl --help

For more information refer to the docker-snapcrawl repository.

Install

$ gem install snapcrawl

Usage

$ snapcrawl --help

Snapcrawl

Usage:
  snapcrawl go <url> [options]
  snapcrawl -h | --help 
  snapcrawl -v | --version

Options:
  -f --folder <path>     Where to save screenshots [default: snaps]
  -a --age <n>           Number of seconds to consider screenshots fresh
                         [default: 86400]
  -d --depth <n>         Number of levels to crawl [default: 1]
  -W --width <n>         Screen width in pixels [default: 1280]
  -H --height <n>        Screen height in pixels. Use 0 to capture the full 
                         page [default: 0]
  -s --selector <s>      CSS selector to capture
  -o --only <regex>      Include only URLs that match <regex>
  -h --help              Show this screen
  -v --version           Show version

Examples:
  snapcrawl go example.com
  snapcrawl go example.com -d2 -fscreens
  snapcrawl go example.com -d2 > out.txt 2> err.txt &
  snapcrawl go example.com -W360 -H480
  snapcrawl go example.com --selector "#main-content"
  snapcrawl go example.com --only "products|collections"

Notes

If a URL cannot be found, Snapcrawl will report to stderr. You can create a report by running

$ snapcrawl go example.com 2> err.txt