Snapcrawl - crawl a website and take screenshots
Snapcrawl is a command line utility for crawling a website and saving screenshots.
Features
- Crawls a website to any given depth and saves screenshots
- Can capture the full length of the page
- Can use a specific resolution for screenshots
- Skips capturing if the screenshot was already saved recently
- Uses local caching to avoid expensive crawl operations if not needed
- Reports broken links
Install
Using Docker
You can run Snapcrawl by using this docker image (which contains all the necessary prerequisites):
shell
$ alias snapcrawl='docker run --rm -it --volume $PWD:/app dannyben/snapcrawl'
For more information on the Docker image, refer to the docker-snapcrawl repository.
Using Ruby
shell
$ gem install snapcrawl
Note that Snapcrawl requires PhantomJS and ImageMagick.
Usage
Snapcrawl can be configured either through a configuration file (YAML), or by specifying options in the command line.
shell
$ snapcrawl
Usage:
snapcrawl URL [--config FILE] [SETTINGS...]
snapcrawl -h | --help
snapcrawl -v | --version
The default configuration filename is snapcrawl.yml
.
Using the --config
flag will create a template configuration file if it is not present:
shell
$ snapcrawl example.com --config snapcrawl
Specifying options in the command line
All configuration options can be specified in the command line as key=value
pairs:
shell
$ snapcrawl example.com log_level=0 depth=2 width=1024
Sample configuration file
```yaml # All values below are the default values
log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
log_level: 1
log_color (yes, no, auto)
# yes = always show log color # no = never use colors # auto = only use colors when running in an interactive terminal log_color: auto
number of levels to crawl, 0 means capture only the root URL
depth: 1
screenshot width in pixels
width: 1280
screenshot height in pixels, 0 means the entire height
height: 0
number of seconds to consider the page cache and its screenshot fresh
cache_life: 86400
where to store the HTML page cache
cache_dir: cache
where to store screenshots
snaps_dir: snaps
screenshot filename template, where ‘%url’ will be replaced with a
# slug version of the URL (no need to include the .png extension) name_template: ‘%url’
urls not matching this regular expression will be ignored
url_whitelist:
urls matching this regular expression will be ignored
url_blacklist:
take a screenshot of this CSS selector only
css_selector: ```
Contributing / Support
If you experience any issue, have a question or a suggestion, or if you wish to contribute, feel free to open an issue.