In June, we
decided
to get rid of XSLT usage in writerfilter, the module responsible for RTF and
DOCX import in LibreOffice. As usual with cleaning up mess, this took time
(about two months), but I’m now happy to say that I’m mostly done with this.
:-)
See the doctok blog post for some
background, the topic here was to clean up the OOXML tokenizer, that is that
building block that turns a zipped XML document into a token stream.
The following problems are now solved:
-
Part of the module was generated code, the generator was implemented mostly
in XSLT, but some bits were written in Perl and sed. About 4200 lines of
XSLT code is now rewritten in Python, in about 1300 lines.
-
Given that we have much more developers who speak Python, compared to XSLT,
nontrivial changes are now much easier in the generator: Jan Holesovsky
cleaned
up boost::unordered_map
usage at places where we depended on the order of
elements. (Yes, you read it correctly, that was the situation up till now!)
This also helps reducing the size of the resulting writerfilter shared library.
-
The input of the code generator was the large model.xml
file, and
generator scripts only extracted interesting information from it, so if you
mistyped something, you got no error messages, just silent failures. I’ve
removed quite some XML elements and attributes from it which were parsed by
none of the generator scripts and written a
relax-ng
schema for the remaining markup. Validating against this schema is part of
the default build, so no more typos without a build failure. ;-)
(The schema also contains quite some documentation, finally.)
-
A gperf hash of all possible OOXML elements / attribute names were
duplicated in writerfilter, even if that information was already available
from the oox module. This is now fixed, reducing the size of the shared
library even further.
-
Also, both oox and writerfilter had a list of namespace URL’s, mapping them
to an integer enumeration, and when the two lists didn’t match, Bad Things
happened (read: usually resulted in a crash.) This is the past, I’ve
refactored writerfilter to use the same namespace alias names as oox, and this
allowed to get rid of the writerfilter copy of the namespace alias list. So in
the future, if new namespaces have to added, only oox has to be extended.
Oh and the bonus feature: I’ve implemented a script called
watch-generated-code.sh,
which can record a good state of the generated code, and then compare later
generated results against that, so that refactoring of the generator can now be
performed in a safe way: you can change the generator in any way to make it
better, and still avoid accidental output changes. :-) This is particularly
useful, as it only diffs the end result of the whole generation process (cxx
and hxx files), not temporarily files, which are OK to change, as long as the
end result is the same.
As a conclusion, here are sizes of a stripped dbgutil version of the
writerfilter shared library, from the libreoffice-4-3-branch-point and today’s
master:
$ git checkout oldest
HEAD is now at b3130c8... 2014-05-21
vmiklos@o9010:~/git/libreoffice/daily$ ls -lh opt/program/libwriterfilterlo.so
-rwxr-xr-x 1 vmiklos users 8,3M aug 28 14:00 opt/program/libwriterfilterlo.so
$ git checkout master
Switched to branch 'master'
vmiklos@o9010:~/git/libreoffice/daily$ ls -lh opt/program/libwriterfilterlo.so
-rwxr-xr-x 1 vmiklos users 6,1M aug 28 14:01 opt/program/libwriterfilterlo.so
Again, the 8,3MB → 6,1MB size reduction is mostly thanks to Kendy’s map cleanups + the
duplicated gperf hash going away. :-)