PandocRefeqMathml - ad hoc tool to modify pandoc-converted MathML from LaTeX

Summary

This Ruby command-line command modifies a MathML file converted with pandoc from LaTeX.

Whereas pandoc is a great text-ish file converter, there are a few caveats, at the time of writing, in converting a LaTeX file to MathML.

A major caveat is the generated MathML does not display the equation numbers that are auto-generated by LaTeX in default for the equation and eqnarray environments, nor their (LaTeX) labels. All the (LaTeX) ref remain as they are, which is a coded message for readers.

Another caveat is the alignments of equations in the eqnarray environment.

This tool is a bit of ad hoc (dirty) hack to correct these points *in some basic situations*. “Basic” here means just the standard LaTeX commands, not some package-specific commands. Also, this may not correctly handle complicated formats of equations, using arrays etc.

The full package of this module is found in PandocRefeqMathml Ruby Gems page (with document created from source annotation with yard) and in Github

Background and constraints

Pandoc-converted MathML.html from LaTeX lacks equation numbers that are present in the original LaTeX. This tool offers a very crude fix, adding equation numbers based on the annotation fields in <math> and LaTeX aux file (which is automatically generated as a byproduct when you compile a LaTeX document). Not all the numbers are recovered but only those that are referenced somewhere in the MathML/LaTeX file.

(Note that in principle, it should not be too difficult to modify the program so that all the labelled equations in LaTeX are labelled again in MathML. However, it would be tricky to label equations that are not explicitly labelled in LaTeX because implicit numbering information is not available in the LaTeX aux file.)

The algorithm assumes a LaTeX standard aux file-format, the MathML having a link tag <a> with the attributes “data-reference-type=ref” and href to the label of the exact reference label in LaTeX (and the label should have no duplicates in the MathML) and also having the ‘annotation[ encoding=“application/x-tex”]’ tag in each math tag containing the original LaTeX code. The LaTeX code must have either the standard “equation” or “eqnarray” structures associated with the standard “label” tag with a simple content (if it contains, apart from the label string, something more than preceding or trailing white spaces, such as a comment, this algorithm would likely fail). If equations in an eqnarray environment have complicated nested structures like a matrix, I do not know how the algorithm of this routine handles them. Also, the LaTeX section numbering must be combinations of Arabic numbers, full-stops, and maybe capital letters (for Appendix) only.

Essentially, LaTeX has a huge amount of freedom and so I am afraid it would be a somewhat futile effort to deal with every possibility…

Output MathML by pandoc-2.19 converted from LaTeX

Ordinary LaTeX inline maths expressions (e.g., $5^2$) are expressed as follows:

<math display="inline" xmlns="http://w..."><semantics>
 <mrow><mn>5</mn><mi>π</mi></mrow>
 <annotation encoding="application/x-tex">5\pi</annotation>
</semantics></math>

LaTeX’s begin{equation} is as follows (n.b., the <p> tag may not be closed immediately after </math> but another ordinary sentences may follow):

<p><math display="block" xmlns="http://w..."><semantics>
 <mrow><mi>x</mi><mo>±</mo><mi>ϵ</mi></mrow></mrow>
 <annotation encoding="application/x-tex">x \pm \epsilon \label{my_xe}</annotation>
</semantics></math>

LaTeX’s begin{eqnarray} is as follows:

<p><math display="block" xmlns="http://w..."><semantics><mtable>
 <mtr><mtd columnalign="right"><mrow><mn>1</mn><mo>+</mo><mi>x</mi></mrow></mtd>
      <mtd columnalign="left"><mo>=</mo></mtd>
      <mtd columnalign="right"><mrow><mn>1</mn><mo>−</mo><mi>x</mi></mrow></mtd></mtr>
 <mtr><mtd columnalign="right"></mtd>
      <mtd columnalign="left"><mo>=</mo></mtd>
      <mtd columnalign="right"><mfrac><mn>2</mn><mrow><mn>1</mn><mi>x</mi></mrow></mfrac></mtd></mtr>
 </mtable><annotation encoding="application/x-tex">\begin{aligned}
  1+x &amp; = &amp; 1-x \nonumber\\
      &amp; = &amp; \frac{2}{1x} \label{eq_trivial}
  \end{aligned}</annotation></semantics></math></p>

They are referred to as from another text follows:

<p>Eq.<a href="#eq_trivial" data-reference-type="ref"
   data-reference="eq_trivial">[eq_trivial]</a> was easy...

Algorithm

For fixing the alignments to follow the standard eqnarray alignments (right, centre, and left in this order), the program searches for <mtable> and rewrites the columnalign attributes in the <mtd> tags.

For fixing the equation numbers and links, the program

  1. first reads a LaTeX aux file and lists all the labels for equations and their numbers.

  2. Then, it picks up an internally-pointing HTML anchor,

  3. matches it with the list generated from the LaTeX aux file and identifies the equation number,

  4. searches labels in <annotation> tags for the identical string for the HTML/MathML-anchor,

  5. identifies the exact equation corresponding to the label (if in the eqnarray environment),

  6. inserts the identified equation number next to the MathML equation,

  7. and finally modifies the plain text for the HTML anchor.

Each of the inserted equation number next to the corresponding equation is inside the <mtext> tags. In <mtable> (for LaTeX eqnarray{}), it is inserted as a new <mtd> cell. In both cases, the text is right-aligned with some padding to the left. However, the position is relative to either the equation or the set of the equations that contains the relevant equation (for LaTeX eqnarray{}) and is not like the original LaTeX, where equation numbers inside a pair of parentheses are always located at the right edge of a page in default.

How to use this command

Once you have installed it according to the standard RubyGems procedure (see section Install), the main Ruby executable (command) pandoc_refeq_mathml should be in your command-search path.

It basically reads a MathML file from either the first command-line argument or STDIN and also a LaTeX aux file specified in a command-line, and then outputs the modified (corrected) MathML to STDOUT.

Any warnings are printed to either STDERR or a log-file specified in a command-line as an option.

Failure in matching the labels from an HTML tag with any of the MathML equations are printed as a warning (to STDERR in default). Although it may genuinely mean the non-existent labels in the original LaTeX source, it is far more likely that the labels belong to one of the sections (or tables of figures), because the algorithm cannot tell what the type (section, table, figure, or equation or else) of each label’s origin is.

Help doc

The help doc for the command-line interface is displayed with -h (or --help) option:

% pandoc_refeq_mathml -h
Usage: pandoc_refeq_mathml [options] [--] [MathML.html] > STDOUT
       pandoc_refeq_mathml [options] [--] < STDIN > STDOUT

Description (Version=0.1):
   This fixes issues, label-references of equations and eqnarray alignments, of pandoc-converted MathML from LaTeX.

Specific options:
    -a, --aux [FILENAME]             (mandatory) LaTeX aux filename
        --log [FILENAME]             Log filename (Default: STDERR). /dev/null to disable it.
        --[no-]fixalign              Fix eqnarray-alignment problems? (Def: true)
    -v, --[no-]verbose               Run verbosely (Def: true)

Common options:
    -h, --help                       Show this message
        --version                    Show version

Examples

% pandoc_refeq_mathml --aux=mydoc.aux --log=error.log mydoc.html > revised1.html
% head -n 90 mydoc.html | pandoc_refeq_mathml --aux=mydoc.aux --no-fixalign > revised2.html

Install

Standard Ruby-gem install procedure is suffice

% gem install pandoc_refeq_mathml

which should also install the dependant Nokogiri gem.

Alternatively, it is possible to download the library file lib/pandoc_refeq_mathml.rb somewhere in your local directory, set the environmental variable RUBYLIB to also point to the directory for the library, and execute

% ruby bin/pandoc_refeq_mathml

where ruby is optional. Note that Nokogiri gem must be available in your RUBY library path.

In the developer’s environment diff-lcs gem is also required.

This tool requires Ruby Version 2.0 or above.

Developer’s note

The source code is maintained also in Github with no intuitive interface for annotations.

Tests

The Ruby codes under the directory test/ are the test scripts. You can run them from the top directory as ruby test/test_****.rb or simply run make test or rake test.

Known bugs and ToDo items

  • pandoc-generated HTMLs do not contain Table/Figure numbers in their <caption>, even though each anchored text refers to the corresponding number, such as, see Table “2”, where “2” is the anchor.

  • In fact, pandoc-generated HTMLs do not generate <figure> tags, let alone <figurecaption> for the LaTeX figure environments that contain more than one figure (with \includegraphics)…

Author

Masa Sakano < info a_t wisebabel dot com >

Versions

The versions of this package follow Semantic Versioning (2.0.0) semver.org/

License

MIT

Warranty

No warranty.