The MathML driver is similar to the LATEX driver, but has some significant differences.
Emptynodes are again translated to empty strings and numbers are marked up with the
<mn>tag. Namenodes are again mapped using a lookup table as before, but we employ a translation table1 that maps all of Adobe’s 4281 PDF characters to their corresponding
Unicode values. This has the advantage that we should not come across any character that is not mapped. On the other hand, mapping to Unicode values, rather than to actual characters or commands as in LATEX, loses information that could be useful for
a future, more detailed semantic analysis. The result of this mapping is uniformly put between <mi> tags, thus operators, normally marked up by <mo> tags, are currently not distinguished. Distinguishing such operators could be achieved, naively, through the use of another lookup table, however we believe a better solution is to attempt proper semantic markup through the use of an OpenMath driver, as we can then exploit the semantic knowledge given in content dictionaries.
We combine consecutive Linear nodes recursively to put them into a single <mrow>
tag. Divnodes are translated into <mfrac> tags and Sub, Super and Supersubnodes are mapped to the MathML environments<msub>,<msup>, and<msubsup>, respectively.
Overand Under nodes are translated to <mover>and <munder> tags, where we set the parameteraccentto true for the former and false for the latter. As opposed to the LATEX
1The translation table is based upon the Adobe Glyph List from http://partners.adobe.com/
driver, in MathML we have to explicitly sort out nested over and under expressions in order to put them into <munderover>. Similarly, Limit nodes are mapped to <munderover>
environments rather than represented as sub- and superscripts.
In terms of Functor nodes we currently handle only root symbols, which are either mapped to <msqrt>or to <mroot> if the expression is combined with an additionalSup
node. The latter is then taken as the index value. Again this analysis is not necessary in the LATEX case as it is handled automatically by LATEX’s conventions.
Finally, Case, Matrix, and Multiline nodes are all handled by <mtable> environ- ments. For the latter the alignment is achieved by using MathML’s special alignment tags <maligngroup/>.
CHAPTER 7
IMPROVEMENTS
In the previous chapter, we described our implementation of Maxtract, a mathematical formula recogniser for PDF documents that takes advantage of the character information within the document to obtain higher quality formula recognition than is typically possible using standard OCR techniques. As can be seen in Section 8.1, Maxtract produces results which, although high quality, can still be improved by the addition of a more careful analysis of spacing issues and utilisation of font information available from the PDF document. In this chapter, we discuss some of the aspects of Maxtract that we have been able to improve and describe how we have done so. There are two significant improvements described in this chapter, the first is the use of fonts and spacing information obtained from the PDF to improve the appearance of generated code and to aid spatial and semantic analysis, these are described in Section 7.1. The second is the introduction of automated segmentation of mathematical formulae together with basic layout analysis in Section 7.2.
7.1
Using Fonts and Spacing
As shown in Section 8.1, whilst the spatial relationships are generally correctly identified by Maxtract, the presentation of the output is often incorrect, in particular, the choice of typeface and the spacing between characters. However, by extracting and exploiting fonts for characters and spacing information from the PDF document, then making use
of the extra information during the parsing of expressions and generation of output, we were able to address the following issues:
• Some alphabetic fonts, e.g. Blackboard Bold, Caligraphic, Fraktur etc., which were previously only recognised as standard Math Roman, are now correctly reproduced.
• Spacing was sometimes incorrect: Large spaces were not appropriately recognised, and subtle space differences between certain components of a formula would not be faithfully reproduced. For example, if the intended meaning of symbols used by the authors was different from those assumed in LATEX(or MathML) by default. Such
spacing is now being recognised and compared to rules for spacing used by TEX. The result not only significantly improves the aesthetic quality of the reproduced formula, but is used to guide the semantic interpretation of the expression.
• Function names such as “sin”, “cos” or “det” were previously recognised via a lookup table. This led to problems with function names that were not in the table or with strings of variables in a mathematical formula that did not represent a function but which happened to match a function name. This table lookup approach has now been replaced with a much more robust method based on character spacing and fonts.
• Interspersed text, i.e. normal text within a mathematical expression, was not cor- rectly recognised and would come out as pseudo-mathematics. Using a similar method as developed for function names means such text is now correctly recog- nised and processed.
The remainder of this section describes these improvements in detail, and the evalua- tion of the changes is completed in Section 8.2.