`httpspell`

This is a spellchecker that recursively fetches HTML pages, converts them to plain text (using pandoc), and spellchecks them with hunspell. Unknown words will be printed to stdout, which makes the tool a good candidate for CI pipelines where you might want to take action when a spelling error is found on a web page.

Words that are not in the dictionary for the given language (inferred from the lang attribute of the HTML document's root element) can be added to a personal dictionary, which will mark the word as correctly spelled.

Usage

The following command will retrieve the HTML document at https://example.com, spellcheck it, and not print anything because there are no errors:

  $ httpspell https://example.com

The exit code is 0.

The following command will spellcheck the README of this project as rendered by GitHub, and print a list of unknown words. Note that we set the language to en_US because GitHub declares 'en' as document language, but the installed dictionaries usually refer the a specific language variant like en_US:

  $ httpspell https://github.com/suhlig/httpspell/blob/master/README.markdown --language en_US
  suhlig
  Permalink
  httpspell
  sloc
  pandoc
  hunspell
  ...

The exit code is 1.

What is not checked

When spidering a site, httpspell will skip all responses with a content-type header other than text/html (unless pointing it to file, in which case it accepts anything).
Before converting, httpspell removes the following nodes from the HTML DOM as they are not a good target for spellchecking:
- code
- pre
- Elements with spellcheck='false' (this is how HTML5 allows tagging elements as a being target for spellchecking or not)

Misc

If you produce content with kramdown (e.g. using Jekyll), setting spellcheck='false' for an element is a simple as adding this line after the element (e.g. heading):

{: spellcheck="false"}