html2odt

This gem provides a Ruby wrapper around the set of XLST stylesheets published as xhtml2odt.

Build Status

html2odt vs. xhtml2odt

So, why is this project called html2odt while the original library and command line tools by Aurélien Bompard are called xhtml2odt?

This project uses nokogiri to parse the HTML and apply the XSLT transformations. Nokogiri implements a forgiving HTML parser and tries to be as forgiving as possible. Furthermore, the basic API expects HTML fragments, not full documents. We are not expecting the users of this library to pass in a complete, valid XHTML document. A reasonably good piece of HTML should be good enough. Therefore we skipped the X in the name as well.

Installation

Add this line to your application's Gemfile:

gem 'html2odt'

And then execute:

$ bundle

Or install it yourself as:

$ gem install html2odt

Usage

Command line Usage

Usage: html2odt.rb [options] -i input.html -o output.odt
    -i, --input input.html
    -o, --output output.odt
    -t, --template <template.odt>    The file that should be filled with the input's content.
                                     Defaults to basic template file which is part of this gem.
    -r, --replace <KEYWORD>          A keyword in the template document to replace with the converted text.
                                     Defaults to `{{content}}`.
    -u, --url <URL>                  The remote URL you downloaded the page from.
                                     This is required to include remote images and to resolve links properly.
    -h, --help                       Show this message

Ruby API usage

# Create an Html2Odt::Document instance
doc = Html2Odt::Document.new

# Set the input HTML
doc.html <<HTML
<h1>Hello, World!</h1>
<p>It works.</p>
HTML

# Set author and title
doc.author = "Jane Doe"
doc.title = "Example Document"


# Write ODT to disk
doc.write_to "demo.odt"

# Or get binary content as string
doc.data

Configuration options

html2odt comes with a basic template.odt, which is as a boilerplate to create the desired ODT file. If you like to provide your own styles or additional content next to the content added via the API, you may provide your own template in the Html2Odt::Document constructor.

Please note: If the template file cannot be read or if it does not appear to be a valid ODT file, an ArgumentError will be raised.

The template needs to contain an otherwise empty paragraph containing the string {{content}}.

# Provide optional template file
doc = Html2Odt::Document.new(template: "template.odt")

The HTML which should become part of the document may also be provided via the constructor

# Provide HTML in constructor
doc = Html2Odt::Document.new(html: <<HTML)
  <h1>Hello, World!</h1>
  <p>It works.</p>
HTML

Furthermore, you may specify a base_uri, which will most likely be the place, the original HTML fragment belongs to. The base_uri will be used to convert links to fully qualified URLs, so that they still work when placed in the ODT document. Furthermore the setting will be used to identify the sources of image's found within the HTML fragments (see below for some detail).

# Provide base_uri
doc = Html2Odt::Document.new
doc.base_uri = "https://www.example.com"

You may also pass a URI instance directly.

# Provide base_uri
doc = Html2Odt::Document.new
doc.base_uri = URI::parse("https://www.example.com")

It is expected, that the URI refers to a http(s) location.

Image handling

html2odt provides basic image inlining, i.e. images referenced in the HTML code will be embeded into the ODT file by default. This is true for images referenced with a full file://, http://, or https:// URL. Absolute URLs (i.e. starting /) and relative URLs are only supported if the base_uri option is set. Otherwise html2odt has no idea, which server or document they are relating to.

Images referencing an unsupported resource will be replaced with a link containing the alt text of the image.

If you are using html2odt in a web application context, you will probably want to provide some special handling for resources residing on your own server. This should be done for security reasons and to save roundtrips.

html2odt provides the following API to map image src attributes to local file locations.

# Provide custom mapping for image locations
doc = Html2Odt::Document.new

doc.image_location_mapping = lambda do |src|
  root = "/var/www/mywebsite/public"
  path = File.join(root, src)

  # File.realpath raises Errno::ENOENT, if `path` does not exist in file system.
  valid = File.realpath(path).starts_with?(root) rescue false

  valid ? path : nil
end

Registering an image_location_mapping callback will deactivate the default behaviour of including images with file and http URLs automatically.

Attention: Be careful! Without a image_location_mapping Proc, html2odt will include any local or remote image into the the resulting ODT. This may cause all kinds of vulnerabilities and should only be used with well known inputs. When registering an image_location_mapping callback, this default behaviour is deactivated, but please make sure, that your custom code, does not introduce path traversal vulnerabilities. Following the above example code should be a good start.

License

Files within the xsl directory belong to the xhtml2odt project published by Aurelien Bompard (2009-2010) under the terms of the GNU LGP v2.1 or later: http://www.gnu.org/licenses/lgpl-2.1.html

The remaining files are licensed under the terms of the MIT license.

Copyright (c) 2016 Gregor Schmidt - Planio GmbH, Berlin, Germany

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Development

After checking out the repo, run bundle install to install dependencies. Then, run rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/planio-gmbh/html2odt. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.