Gem Version Build Status Code Climate

Gem to convert an HTML document into a Word document (.doc) format. This is intended for automated generation of Microsoft Word documents, given HTML documents, which are much more readily crafted.

This gem originated out of https://github.com/riboseinc/asciidoctor-iso, which creates a Word document from a automatically generated HTML document (created in turn by processing Asciidoc).

This work is driven by the Word document generation procedure documented in http://sebsauvage.net/wiki/doku.php?id=word_document_generation. For more on the approach taken, and on alternative approaches, see https://github.com/riboseinc/html2doc/wiki/Why-not-docx%3F

The gem currently does the following:

  • Convert any AsciiMath and MathML to Word’s native mathematical formatting language, OOXML. Word supports copy-pasting MathML into Word and converting it into OOXML; however the conversion is not infallible (we have found problems with \sum: Word claims parameters were missing, and inserting dotted squares to indicate as much), and you may need to post-edit the OOXML.

    • The gem does attempt to repair the MathML input, to bring it in line with Word’s OOXML’s expectations. If you find any issues with AsciiMath or MathML input, please raise an issue.

  • Identify any footnotes in the document (defined as hyperlinks with attributes class = "Footnote" or epub:type = "footnote"), and render them as Microsoft Word footnotes.

  • Resize any images in the HTML file to fit within the maximum page size. (Word will otherwise crash on reading the document.)

  • Optionally apply list styles with predefined bullet and numbering from a Word CSS to the unordered and ordered lists in the document, restarting numbering for each ordered list.

  • Convert any internal @id anchors to [email protected] anchors; Word only hyperlinks to the latter.

  • Generate a filelist.xml listing of all files to be bundled into the Word document.

  • Assign the class MsoNormal to any paragraphs that do not have a class, so that they can be treated as Normal Style when editing the Word document.

  • Inject Microsoft Word-specific CSS into the HTML document. If a CSS file is not supplied, the CSS file used is at lib/html2doc/wordstyle.css is used by default. Microsoft Word HTML has particular requirements from its CSS, and you should review the sample CSS before replacing it with your own. (This generic CSS can be overridden by CSS already in the HTML document, since the generic CSS is injected at the top of the document.)

  • Bundle up the images, the HTML file of the document proper, and the header.html file representing header/footer information, into a MIME file, and save that file to disk (so that Microsoft Word can deal with it as a Word file.)

For a representative generator of HTML that uses this gem in postprocessing, see https://github.com/riboseinc/asciidoctor-iso

Constraints

This generates .doc documents. Future versions may upgrade the output to docx.

There there are two other Microsoft Word vendors in the Ruby ecosystem.

  • https://github.com/jetruby/puredocx generate Word documents from a ruby struct as a DSL, rather than converting a preexisting html document. That constrains it’s coverage to what is explicitly catered for in the DSL.

  • https://github.com/MuhammetDilmac/Html2Docx is a much simpler wrapper around html: it does not do any of the added functionality described above (image resizing, converting footnotes, AsciiMath and MathML). However it does already generate docx, which involves many more auxiliary files than the .doc format. (Any attempt to generate docx through this gem will likely involve Html2Docx.)

Usage

require "html2doc"

Html2Doc.process(result, filename: filename, stylesheet: stylesheet, header_filename: header_filename, dir: dir, asciimathdelims: asciimathdelims, liststyles: liststyles)
result

is the Html document to be converted into Word, as a string.

filename

is the name the document is to be saved as, without a file suffix

stylesheet

is the full path filename of the CSS stylesheet for Microsoft Word-specific styles. If this is not provided, the program will used the default stylesheet included in the gem, lib/html2doc/wordstyle.css. The stylsheet provided must match this stylesheet; you can obtain one by saving a Word document with your desired styles to HTML, and extracting the style definitions from the HTML document header.

header_filename

is the filename of the HTML document containing header and footer for the document, as well as footnote/endnote separators; if there is none, use nil. To generate your own such document, save a Word document with headers/footers and/or footnote/endnote separators as an HTML document; the header.html will be in the {filename}.fld folder generated along with the HTML. A sample file is available at https://github.com/riboseinc/asciidoctor-iso/blob/master/lib/asciidoctor/iso/word/header.html

dir

is the folder that any ancillary files (images, headers, filelist) are to be saved to. If not provided, it will be created as {filename}_files. Anything in the directory will be attached to the Word document; so this folder should only contain the images that accompany the document. (If the images are elsewhere on the local drive, the gem will move them into the folder.)

asciimathdelims

are the AsciiMath delimiters used in the text (an array of an opening and a closing delimiter). If none are provided, no AsciiMath conversion is attempted.

liststyles

a hash of list style labels in Word CSS, which are used to define the behaviour of list item labels (e.g. i) vs i.). The gem recognises the hash keys ul, ol. So if the appearance of an ordered list’s item labels in the supplied stylesheet is governed by style @list l1 (e.g. @list l1:level1 {mso-level-text:"%1\)";} appears in the stylesheet), call the method with liststyles:{ol: "l1"}.

Note that the local CSS stylesheet file contains a variable FILENAME for the location of footnote/endnote separators and headers/footers, which are provided in the header HTML file. The gem replaces FILENAME with the file name that the document will be saved as. If you supply your own stylesheet and also wish to use separators or headers/footers, you will likewise need to replace the document name mentioned in your stylesheet with a FILENAME string.

We include a script in this distribution that processes files from the command line, optionally including header and stylesheet:

$ bin/html2doc --header header.html --stylesheet stylesheet.css filename.html

Caveats

HTML

The good news is that Word understands HTML.

The bad news is that Word’s understanding of HTML is HTML 4. In order for bookmarks to work, for example, this gem has to translate <p id=""> back down into <p><a name="">. Word (and this gem) will not do much with HTML 5-specific elements, and if you’re generating HTML for automated generation of Word documents, keep your HTML old-fashioned.

CSS

The good news with generating a Word document via HTML is that Word understands CSS, and you can determine much of what the Word document looks like by manipulating that CSS. That extends to features that are not part of HTML CSS: if you want to work out how to get Word to do something in CSS, save a Word document that already does what you want as HTML, and inspect the HTML and CSS you get.

The bad news is that Word’s implementation of CSS is poorly documented — even if Office HTML is documented in a 1300 page document (online at https://stigmortenmyre.no/mso/, https://www.rodriguezcommaj.com/assets/resources/microsoft-office-html-and-xml-reference.pdf), and the CSS selectors are only partially and selectively implemented. For list styles, for example, mso-level-text governs how the list label is displayed; but it is only recognised in a @list style: it is ignored in a CSS rule like ol li, or in a style attribute on a node. Working out the right CSS for what you want will take some trial and error, and you are better placed to try to do things Word’s way than the right way.

XSLT

This gem is published with an early draft of the XSLT stylesheet transforming MathML into OOXML, mml2omml.xsl, that has published for several years now as part of the TEI stylesheet set. (We have made some further minor edits to the stylesheet.) The stylesheets have been published under a dual Creative Commons Sharealike/BSD licence.

The good news is that the stylesheet is not identical to the stylesheet mathml2omml.xsl that is published with Microsoft Word, so it can and has been redistributed.

The bad news is that the stylesheet is not identical to the stylesheet mathml2omml.xsl that is published with Microsoft Word, so it isn’t guaranteed to have identical output. If you want to make sure that your MathML import is identical to what Word currently uses, replace mml2omml.xsl with mathml2omml.xsl, and edit the gem accordingly for your local installation. On Windows, you will find the stylesheet in the same directory as the winword.exe executable. On Mac, right-click on the Word application, and select "Show Package Contents"; you will find the stylesheet under Contents/Resources.

Lists

Natively, Word does not use <ol>, <ul>, or <dl> lists in its HTML exports at all: it uses paragraphs styled with list styles. If you save a Word document as HTML in order to use its CSS for Word documents generated by HTML, those styles will still work (with the caveat that you will need to extract the @list style specific to ordered and unordered lists, and pass it as a liststyles parameter to the conversion). However, Word applies a default indentation to all instances of <ol>, <ul> and <dl>, which the CSS stylesheet of a Word HTML will not have accounted for (because the Word HTML does not use lists at all.) If you are going to reuse that CSS for generating new documents using lists, you will need to add the rule margin-left:0pt to ul, ol, dl in the CSS stylesheet you supply, so that the margins in the Word-exported CSS remain correct.

Math Positioning

By default, mathematical formulas that are the only content of their paragraph are rendered as centered in Word. If you want your AsciiMath or MathML to be left-aligned or right-aligned, add style="text-align:left" or style="text-align:right" to its ancestor div, p or td node in HTML.

Example

The spec/examples directory includes rice.doc and its source files: this Word document has been generated from rice.html through a call to html2doc from https://github.com/riboseinc/asciidoctor-iso. (The source document rice.html was itself generated from Asciidoc, rather than being hand-crafted.)