HTML::Proofer

If you generate HTML files, then this tool might be for you.

HTML::Proofer is a set of tests to validate your HTML output. These tests check if your image references are legitimate, if they have alt tags, if your internal links are working, and so on. It's intended to be an all-in-one checker for your output.

Build Status Gem Version

Installation

Add this line to your application's Gemfile:

gem 'html-proofer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install html-proofer

NOTE: When installation speed matters, set NOKOGIRI_USE_SYSTEM_LIBRARIES to true in your environment. This is useful for increasing the speed of your Continuous Integration builds.

Real-life examples

Project Repository
Raspberry Pi documentation raspberrypi/documentation
Open Whisper Systems website WhisperSystems/whispersystems.org
Jekyll website jekyll/jekyll

What's Tested?

Images

img elements:

  • Whether all your images have alt tags
  • Whether your internal image references are not broken
  • Whether external images are showing

a, link elements:

  • Whether your internal links are not broken; this includes hash references (#linkToMe)
  • Whether external links are working

Scripts

script elements:

  • Whether your internal script references are not broken
  • Whether external scripts are loading

Favicon

Checks if your favicons are valid. This is an optional feature, set the check_favicon option to turn it on.

HTML

Nokogiri looks at the markup and provides errors when parsing your document. This is an optional feature, set the check_html option to enable validation errors from Nokogiri.

Usage

Using in a script

Require the gem; generate some HTML; create a new instance of the HTML::Proofer on your output folder; then run it. Here's a simple example:

require 'html/proofer'
require 'html/pipeline'
require 'find'

# make an out dir
Dir.mkdir("out") unless File.exist?("out")

pipeline = HTML::Pipeline.new [
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::TableOfContentsFilter
], :gfm => true

# iterate over files, and generate HTML from Markdown
Find.find("./docs") do |path|
  if File.extname(path) == ".md"
    contents = File.read(path)
    result = pipeline.call(contents)

    File.open("out/#{path.split("/").pop.sub('.md', '.html')}", 'w') { |file| file.write(result[:output].to_s) }
  end
end

# test your out dir!
HTML::Proofer.new("./out").run

Using on the command-line

You'll get a new program called htmlproof with this gem. Terrific!

Use it like you'd expect to:

htmlproof ./out --href-swap wow:cow,mow:doh --ext .html.erb --href-ignore www.github.com

Note: since href_swap is a bit special, you'll pass in a pair of RegEx:String values. htmlproof will figure out what you mean.

Using with Jekyll

Want to use HTML Proofer with your Jekyll site? Awesome. Simply add gem 'html-proofer' to your Gemfile as described above, and add the following to your Rakefile, using rake test to execute:

require 'html/proofer'

task :test do
  sh "bundle exec jekyll build"
  HTML::Proofer.new("./_site").run
end

Don't have or want a Rakefile? You could also do something like the following:

htmlproof ./_site

Configuration

The HTML::Proofer constructor takes an optional hash of additional options:

Option Description Default
alt_ignore An array of Strings or RegExps containing imgs whose missing alt tags are safe to ignore. []
empty_alt_ignore If true, ignores images with empty alt tags. false
check_external_hash Checks whether external hashes exist (even if the website exists). This slows the checker down. false
checks_to_ignore An array of Strings indicating which checks you'd like to not perform. []
directory_index_file Sets the file to look for when a link refers to a directory. index.html
disable_external If true, does not run the external link checker, which can take a lot of time. false
error_sort Defines the sort order for error output. Can be :path, :desc, or :status. :path
ext The extension of your HTML files including the dot. .html
file_ignore An array of Strings or RegExps containing file paths that are safe to ignore. []
href_ignore An array of Strings or RegExps containing hrefs that are safe to ignore. Note that non-HTTP(S) URIs are always ignored. []
href_swap A hash containing key-value pairs of RegExp => String. It transforms links that match RegExp into String via gsub. {}
only_4xx Only reports errors for links that fall within the 4xx status code range. false
url_ignore An array of Strings or RegExps containing URLs that are safe to ignore. It affects all HTML attributes. Note that non-HTTP(S) URIs are always ignored. []
check_favicon Enables the favicon checker. false
check_html Enables HTML validation errors from Nokogiri false
verbose If true, outputs extra information as the checking happens. Useful for debugging. false

Configuring Typhoeus and Hydra

Typhoeus is used to make fast, parallel requests to external URLs. You can pass in any of Typhoeus' options for the external link checks with the options namespace of :typhoeus. For example:

HTML::Proofer.new("out/", {:ext => ".htm", :typhoeus => { :verbose => true, :ssl_verifyhost => 2 } })

This sets HTML::Proofer's extensions to use .htm, and gives Typhoeus a configuration for it to be verbose, and use specific SSL settings. Check the Typhoeus documentation for more information on what options it can receive.

You can similarly pass in a :hydra option with a hash configuration for Hydra.

The default value is typhoeus => { :followlocation => true }.

Configuring Parallel

Parallel is being used to speed internal file checks. You can pass in any of its options with the options "namespace" :parallel. For example:

HTML::Proofer.new("out/", {:ext => ".htm", :parallel => { :in_processes => 3} })

:in_processes => 3 will be passed into Parallel as a configuration option.

Instead of a directory as the first argument, you can also pass in an array of links:

HTML::Proofer.new(["http://github.com", "http://jekyllrb.com"])

This configures Proofer to just test those links to ensure they are valid. Note that for the command-line, you'll need to pass a special --as-links argument:

bin/htmlproof www.google.com,www.github.com --as-links

Ignoring content

Add the data-proofer-ignore attribute to any tag to ignore it from the checks.

<a href="http://notareallink" data-proofer-ignore>Not checked.</a>

Custom tests

Want to write your own test? Sure! Just create two classes--one that inherits from HTML::Proofer::CheckRunner, and another that inherits from HTML::Proofer::Checkable.

The CheckRunner subclass must define one method called run. This is called on your content, and is responsible for performing the validation on whatever elements you like. When you catch a broken issue, call add_issue(message) to explain the error.

The Checkable subclass defines various helper methods you can use as part of your test. Usually, you'll want to instantiate it within run. You have access to all of your element's attributes.

Here's an example custom test that protects against mailto links that point to [email protected]:

class OctocatLinkCheck < ::HTML::Proofer::Checkable

  def mailto?
    return false if @data_ignore_proofer || @href.nil? || @href.empty?
    return @href.match /^mailto\:/
  end

  def octocat?
    return @href.match /\:[email protected]\Z/
  end

end

class MailToOctocat < ::HTML::Proofer::CheckRunner

  def run
    @html.css('a').each do |l|
      link = OctocatLinkCheck.new l, self

      if link.mailto? && link.octocat?
        return add_issue("Don't email the Octocat directly!")
      end
    end
  end
end

Troubleshooting

Certificates

To ignore certificates turn off the Typhoeus SSL verification:

HTML::Proofer.new("out/", {
  :typhoeus => {
    :ssl_verifypeer => false,
    :ssl_verifyhost => 0}
}).run

User-Agent

To change the User-Agent used by Typhoeus:

HTML::Proofer.new("out/", {
  :typhoeus => {
    :headers => { "User-Agent" => "Mozilla/5.0 (compatible; My New User-Agent)" }
}}).run