This is a spellchecker that recursively fetches HTML pages, converts them to plain text (using pandoc), and spellchecks them with hunspell. Unknown words will be printed to
stdout, which makes the tool a good candidate for CI pipelines where you might want to take action when a spelling error is found on a web page.
Words that are not in the dictionary for the given language (inferred from the
lang attribute of the HTML document's root element) can be added to a personal dictionary, which will mark the word as correctly spelled.
- The following command will retrieve the HTML document at https://example.com, spellcheck it, and not print anything because there are no errors:
$ httpspell https://example.com
The exit code is
- The following command will spellcheck the README of this project as rendered by GitHub, and print a list of unknown words. Note that we set the language to
en_USbecause GitHub declares 'en' as document language, but the installed dictionaries usually refer the a specific language variant like
$ httpspell https://github.com/suhlig/httpspell/blob/master/README.markdown --language en_US suhlig Permalink httpspell sloc pandoc hunspell ...
The exit code is
What is not checked
- When spidering a site,
httpspellwill skip all responses with a
content-typeheader other than
text/html(unless pointing it to file, in which case it accepts anything).
- Before converting,
httpspellremoves the following nodes from the HTML DOM as they are not a good target for spellchecking:
- Elements with
spellcheck='false'(this is how HTML5 allows tagging elements as a being target for spellchecking or not)
If you produce content with kramdown (e.g. using Jekyll), setting
spellcheck='false' for an element is a simple as adding this line after the element (e.g. heading):