Index ¦ Archives ¦ RSS

Optimizing ODT ↔ XHTML conversion performance for simple documents

Estimated read time: 2 minutes

I worked on improving the ODT ↔ XHTML conversion performance for simple documents in LibreOffice recently. First, thanks to Vector for funding Collabora to make this possible.

ODT → XHTML conversion

https://farm5.staticflickr.com/4605/26697712598_2ace3f45a3_o.png

The focus here was really simple documents, like just one sentence with minimal formatting. The use-case is to have thousands of these simple documents, only a minority containing complex formatting, the rest is just that simple.

Performance work usually focuses on one specific complex feature, e.g. lots of bookmarks, lots of document-level user-defined metadata, and so on — this way there were room for improvements when it comes to trivial documents.

I managed to reduce the cost of the conversion to the fifth of the original cost in both directions — the chart above shows the impact of my work for the ODT → XHTML direction. The steps that helped:

  • Recognize XHTML as a value for the FilterOptions key in the HTML (StarWriter) export filter, this way avoid the need to go via XSLT, which would be expensive.

  • Add a new NoFileSync flag to the frame::XStorable::storeToURL() API, so that if you know you’ll read the result after the conversion finished, you can avoid an expensive fsync() call for each and every file, which helps HDDs a lot, while means no overhead for SSDs.

  • If you know your input format already, then specifying an explicit FilterName key for the frame::XComponentLoader::loadComponentFromURL() API helps not spending time to detect the file format you already know.

Note that the XHTML mode for the Writer HTML export is still a work in progress, but it already produces valid output for such simple documents.

XHTML → ODT conversion

https://farm5.staticflickr.com/4608/39674632615_de78265c7f_o.png

The chart above shows the results of my work for the XHTML → ODT direction. The steps to get to the final reduced cost were:

  • The new NoFileSync flag, as mentioned previously.

  • A new NoThumbnail flag, which is useful if the ODT will be part of a next step in the pipeline and you know that the thumbnail image won’t be used anyway.

  • The default table autoformat definitions in Writer are now lazy-loaded. (This is my favorite one, you don’t have to opt-in for this, so everyone benefits.)

  • A new HiddenForConversion flag for frame::XComponentLoader::loadComponentFromURL(), which means we don’t lay out the UI elements (toolbars, sidebar, status bar, etc.) when we know the purpose of the document load is only to save the document model in an other format.

All this is available in master (towards LibreOffice 6.1), or you can grab a daily build and try it out right now. :-)

© Miklos Vajna. Built using Pelican. Theme by Giulio Fidente on github.