Cleanup of ooxmltok in LibreOffice

Posted on: Thu 28 August 2014

Estimated read time: 3 minutes

In June, we decided to get rid of XSLT usage in writerfilter, the module responsible for RTF and DOCX import in LibreOffice. As usual with cleaning up mess, this took time (about two months), but I’m now happy to say that I’m mostly done with this. :-)

See the doctok blog post for some background, the topic here was to clean up the OOXML tokenizer, that is that building block that turns a zipped XML document into a token stream.

The following problems are now solved:

Part of the module was generated code, the generator was implemented mostly in XSLT, but some bits were written in Perl and sed. About 4200 lines of XSLT code is now rewritten in Python, in about 1300 lines.
Given that we have much more developers who speak Python, compared to XSLT, nontrivial changes are now much easier in the generator: Jan Holesovsky cleaned up boost::unordered_map usage at places where we depended on the order of elements. (Yes, you read it correctly, that was the situation up till now!) This also helps reducing the size of the resulting writerfilter shared library.
The input of the code generator was the large model.xml file, and generator scripts only extracted interesting information from it, so if you mistyped something, you got no error messages, just silent failures. I’ve removed quite some XML elements and attributes from it which were parsed by none of the generator scripts and written a relax-ng schema for the remaining markup. Validating against this schema is part of the default build, so no more typos without a build failure. ;-) (The schema also contains quite some documentation, finally.)
A gperf hash of all possible OOXML elements / attribute names were duplicated in writerfilter, even if that information was already available from the oox module. This is now fixed, reducing the size of the shared library even further.
Also, both oox and writerfilter had a list of namespace URL’s, mapping them to an integer enumeration, and when the two lists didn’t match, Bad Things happened (read: usually resulted in a crash.) This is the past, I’ve refactored writerfilter to use the same namespace alias names as oox, and this allowed to get rid of the writerfilter copy of the namespace alias list. So in the future, if new namespaces have to added, only oox has to be extended.

Oh and the bonus feature: I’ve implemented a script called watch-generated-code.sh, which can record a good state of the generated code, and then compare later generated results against that, so that refactoring of the generator can now be performed in a safe way: you can change the generator in any way to make it better, and still avoid accidental output changes. :-) This is particularly useful, as it only diffs the end result of the whole generation process (cxx and hxx files), not temporarily files, which are OK to change, as long as the end result is the same.

As a conclusion, here are sizes of a stripped dbgutil version of the writerfilter shared library, from the libreoffice-4-3-branch-point and today’s master:

$ git checkout oldest
HEAD is now at b3130c8... 2014-05-21
vmiklos@o9010:~/git/libreoffice/daily$ ls -lh opt/program/libwriterfilterlo.so
-rwxr-xr-x 1 vmiklos users 8,3M aug   28 14:00 opt/program/libwriterfilterlo.so
$ git checkout master
Switched to branch 'master'
vmiklos@o9010:~/git/libreoffice/daily$ ls -lh opt/program/libwriterfilterlo.so
-rwxr-xr-x 1 vmiklos users 6,1M aug   28 14:01 opt/program/libwriterfilterlo.so

Again, the 8,3MB → 6,1MB size reduction is mostly thanks to Kendy’s map cleanups + the duplicated gperf hash going away. :-)

Category: libreoffice – Tags: en