Index ¦ Archives ¦ RSS > Tag: en

Optimizing ODT ↔ XHTML conversion performance for simple documents

Estimated read time: 2 minutes

I worked on improving the ODT ↔ XHTML conversion performance for simple documents in LibreOffice recently. First, thanks to Vector for funding Collabora to make this possible.

ODT → XHTML conversion

https://farm5.staticflickr.com/4605/26697712598_2ace3f45a3_o.png

The focus here was really simple documents, like just one sentence with minimal formatting. The use-case is to have thousands of these simple documents, only a minority containing complex formatting, the rest is just that simple.

Performance work usually focuses on one specific complex feature, e.g. lots of bookmarks, lots of document-level user-defined metadata, and so on — this way there were room for improvements when it comes to trivial documents.

I managed to reduce the cost of the conversion to the fifth of the original cost in both directions — the chart above shows the impact of my work for the ODT → XHTML direction. The steps that helped:

  • Recognize XHTML as a value for the FilterOptions key in the HTML (StarWriter) export filter, this way avoid the need to go via XSLT, which would be expensive.

  • Add a new NoFileSync flag to the frame::XStorable::storeToURL() API, so that if you know you’ll read the result after the conversion finished, you can avoid an expensive fsync() call for each and every file, which helps HDDs a lot, while means no overhead for SSDs.

  • If you know your input format already, then specifying an explicit FilterName key for the frame::XComponentLoader::loadComponentFromURL() API helps not spending time to detect the file format you already know.

Note that the XHTML mode for the Writer HTML export is still a work in progress, but it already produces valid output for such simple documents.

XHTML → ODT conversion

https://farm5.staticflickr.com/4608/39674632615_de78265c7f_o.png

The chart above shows the results of my work for the XHTML → ODT direction. The steps to get to the final reduced cost were:

  • The new NoFileSync flag, as mentioned previously.

  • A new NoThumbnail flag, which is useful if the ODT will be part of a next step in the pipeline and you know that the thumbnail image won’t be used anyway.

  • The default table autoformat definitions in Writer are now lazy-loaded. (This is my favorite one, you don’t have to opt-in for this, so everyone benefits.)

  • A new HiddenForConversion flag for frame::XComponentLoader::loadComponentFromURL(), which means we don’t lay out the UI elements (toolbars, sidebar, status bar, etc.) when we know the purpose of the document load is only to save the document model in an other format.

All this is available in master (towards LibreOffice 6.1), or you can grab a daily build and try it out right now. :-)


EPUB export in LibreOffice Writer FOSDEM talk

Estimated read time: 1 minutes

Yesterday I gave an EPUB export in LibreOffice Writer FOSDEM talk at FOSDEM 2018, in the Open document editors developer room. The room was well-crowded — perhaps because the next talk was about LibreOffice/Collabora Online. ;-)

Quite some other slides will be available on Planet I expect, don’t miss them.


EPUB3 export improvements in Libreoffice Writer, take two

Estimated read time: 3 minutes

I worked on improving the EPUB3 export filter in LibreOffice further recently. First, thanks to Nou&Off in cooperation with a customer who made this work possible. Since the previous blog entry there have been a number of improvements around a next set of topics.

Cover images

https://farm5.staticflickr.com/4760/38920770224_b247fa89c4_o.png

It is now possible to specify a cover image for the exported EPUB file. Given that a cover image is not naturally part of the Writer document model, I introduced the concept of a media directory for the EPUB export. The media directory is a directory next to the source file, with the <file name without extension> name. If that directory contains a file named cover.svg (or .gif, .jpg, .png), the exporter will automatically use it. Otherwise you can customize this default.

The picture shows two EPUB files in Readium with different cover images.

Improved metadata support

https://farm5.staticflickr.com/4603/38920770174_142950782e_o.png

It’s quite frequent that you are technically author of a document, but the logical author of the book is somebody else. Same for the date of the book, and so on. So the EPUB export dialog now has support for overwriting the defaults coming from the Writer document model. For mass-conversion of documents it’s possible to place a <file name without extension>.xmp file in the media directory and XMP metadata from that file will also overwrite metadata coming from the document model.

The picture shows the extended EPUB export options dialog.

Footnotes and image popups

https://farm5.staticflickr.com/4612/38920770144_e90e2a8e92_o.png

I’ve added support for footnotes. As a special case of this, image popups on images and text is now supported. This works by placing a relative link on a text portion or on an image, and placing an image with the same name (e.g. in high resolution) in the media directory. In this case the EPUB export will bundle the image from the media directory inside the EPUB file and clicking on the text or image will open the bundled image in a popup (or in some other container, depending on how your reader interprets footnotes).

The picture shows such a popup in Microsoft Edge.

Fixed layout

https://farm5.staticflickr.com/4604/38920770104_108465bda1_o.png

The EPUB3 fixed layout is quite similar to PDF, just it is built on top of XHTML and SVG. Possible use-cases for this can be:

  • exporting a document where presenting the content as reflowable text would be misleading (e.g. comic books), but the publisher of the book only works with EPUB (reflowable or fixed layout, but no PDF)

  • printing (again, in case for some reason you want to avoid PDF)

These might be very specific situations, but luckily supporting them is not too complex. I implemented an approach very similar to the PDF export, where we export individual pages of the Writer document’s layout as a metafile, and then consume that — this time with the SVG export. Building on top of the existing Writer layout and SVG export means the hard work is really done by these components, the EPUB fixed layout export just puts these together.

The picture shows a Writer document with a table of contents containing page numbers, a header and a footer in Readium.

All this is available in master (towards LibreOffice 6.1), or you can grab a daily build and try it out right now. :-)


EPUB3 export improvements in Libreoffice Writer

Estimated read time: 2 minutes

I worked on improving the EPUB3 export filter in LibreOffice recently. First, thanks to Nou&Off in cooperation with a customer who made this work possible. Since the previous blog entry there have been a number of improvements around 4 topics.

https://farm5.staticflickr.com/4540/38847800651_d5271ced3a_o.png

The character properties of link text is now handled correctly, in the above example you can see that the text is red, and this comes from a character style.

Improved table support

Previously the support for tables was there just to not loose content, now all kinds cell, row and table properties are handled correctly. A few samples

  • custom cell width:

https://farm5.staticflickr.com/4566/38847800611_38b8483d7f_o.png
  • custom row height:

https://farm5.staticflickr.com/4580/38847800521_26285a9152_o.png
  • row span:

https://farm5.staticflickr.com/4540/38847800461_359651bc3d_o.png

So the table support should be now decent, covering row and column spanning and various cell border properties.

Improved image support

Previously only the simplest as-character anchoring was supported. Now much more cases are handled. Two examples:

  • image borders:

https://farm5.staticflickr.com/4541/24975193838_94818bd1ed_o.png
  • image with a caption:

https://farm5.staticflickr.com/4568/24975193608_83239bf287_o.png

This includes various wrap types (to the extent HTML5 allows representing ODF wrap types).

Font embedding

If the user chooses to embed fonts (via File → Properties → Font → Embed), then the EPUB export now handles this. Here is a custom font that is typically not available:

https://farm5.staticflickr.com/4561/38847800811_613d6fbbd2_o.png

(The screenshot is from the Calibre ebook reader.)

All this is available in master (towards LibreOffice 6.1), or you can grab a daily build and try it out right now. :-)


Basic EPUB3 export in Libreoffice

Estimated read time: 2 minutes

https://farm5.staticflickr.com/4577/37588898064_117dc4a933_o_d.png

I worked on a new EPUB3 export filter in LibreOffice recently. First, thanks to Nou&Off in cooperation with a customer who made this work possible. The current state is that basic features work nicely to the extent that the filter is probably usable for most books (they typically mostly have just text with minimal formatting), so this post aims to explain the architecture, how the various pieces fit together.

The above picture shows the building blocks. The idea is that nominally EPUB is a complete export filter, but instead of doing all the work, we offload various sub-tasks to other modules:

  • First we invoke the existing (flat) ODT export, so we can work with ODF instead of with the UNO API directly. This will be useful in the next step.

  • Then we feed the SAX events from the ODT export to a new librevenge text export. Given that the librevenge API is really close to ODF (and xmloff/ has quite some code to map the UNO API to ODF), here it pays off to work with ODF and not with the UNO API directly.

  • The librevenge text export talks to a librevenge generator, which is David Tardon’s excellent libepubgen in this case.

  • Finally libepubgen calls back to LibreOffice, and our package code does the ZIP compression.

The setup is a bit complicated, but it has a number of advantages:

  • Instead of reinventing the wheel, LO and DLP now shares code, libepubgen is now a dependency of LibreOffice.

  • libepubgen doesn’t bring its own ZIP writer code, it can nicely reuse our existing one.

  • This is a great opportunity to finally write an ODT→librevenge bridge, so other DLP-based export libs can be added in the future (e.g. librvngabw).

  • If we ever want to export to EPUB from Draw/Impress, libepubgen will help us there as well.

As a user, here is a list of features you can expect working:

  • plain text should work fine (formatting may be lost, but content should be fine)

  • table of contents, as long as you properly use headings or you separate chapters by page breaks

  • export options: EPUB3 vs EPUB2, split on headings vs page breaks

  • basic set of character and paragraph properties should work

During development I regularly used epubcheck, so hopefully the export result is usually valid.

All this is available in master (towards LibreOffice 6.0), or you can grab a daily build and try it out right now. :-)


A year in LibreOffice’s PDF support LOCon talk

Estimated read time: 1 minutes

A year in LibreOffice’s PDF support was a talk I gave today at LibreOffice conference 2017. Given that this was one of the last talks at the whole conference, thanks to the ones who still did not go home, but listened. :-)


LibreOffice: Code Structure LOCon talk

Estimated read time: 1 minutes

Today I gave a LibreOffice: Code Structure talk at LibreOffice conference 2017. These are an updated version of Michael Meeks' original slides, it’s actually surprised me how many things changed since April 2016. :-)


pdfium path segment API for LibreOffice's test needs

Estimated read time: 2 minutes

I recently fixed tdf#108963, which is a PDF export bug — in case of highlighted and rotated text in e.g. Impress, the highlight rectangle in the PDF export was not rotated.

This is how the export result looked like:

https://farm5.staticflickr.com/4341/37305427601_db1cfb697e_o.png

And this is how it now looks like, after fixing:

https://farm5.staticflickr.com/4453/37258379126_b20fd39655_o.png

For a long time the PDF export filter had no tests at all; the current approach I introduced is that we parse the PDF export result with pdfium, which is an excellent PDF rendering library (I covered it in general in an earlier post).

So given that pdfium knows how that rectangle looks like, we should be able to query the details of it from a test as well, correct? It depends. Yes, it’s possible technically, but no, most of the pdfium functionality is actually not exposed at its public API.

The current situation is that one could use FPDF_LoadMemDocument(), FPDF_LoadPage() to get access to a PDF page, then FPDFPage_CountObject() and FPDFPage_GetObject() to iterate over objects on a page. We can filter for the relevant object by using FPDFPageObj_GetType() and FPDFPath_GetFillColor(), that will give us the only path that has a yellow fill color.

But getting more info about the geometry of the path isn’t really possible. As a workaround I went with FPDFPageObj_GetBounds() for the test, but wouldn’t it be nicer to get the individual segments (the objects that are the children of a path) and then get coordinates and other properties of a segment? This is what the recent API I added to pdfium now does. It provides the followings:

  • FPDFPath_CountSegments() gives you the number of segments of a path

  • FPDFPath_GetPathSegment() gives you a given segment, via a new FPDF_PATHSEGMENT opaque type

  • you can use FPDFPathSegment_GetPoint() to get the coordinates, FPDFPathSegment_GetType() to get the type (move to, line to, etc.) and FPDFPathSegment_GetClose() to see if the segment closes the current subpath of the path (or not)

This means that after the next pdfium update in LibreOffice, PDF export tests can nicely assert these properties of paths instead of dubious bounding box should be larger after rotation assertions.


Split sections inside tables for LibreOffice Writer

Estimated read time: 2 minutes

Tables and sections in LibreOffice Writer are both containers, and in some cases it makes sense to have sections inside tables or tables inside sections. (For example you can mark a group of paragraphs as read-only by including them in a read-only section.) Tables in sections, split over multiple pages was already working, but now it’s possible to have sections in tables split over multiple pages as well.

First, thanks Escriba who made this work possible.

There were 3 parts of this work, you can read some details about them below.

Split of multi-line paragraphs

The first goal was to handle the split of multi-line paragraphs inside sections inside tables. Initially this looked like this:

https://farm5.staticflickr.com/4430/35957293074_cfeabe6a51_o.png
https://farm5.staticflickr.com/4393/35957293014_ae8f210542_o.png

Split of one-liner paragraphs

Technically this is a situation different to the previous one, as split paragraphs have a master (first) frame and one or more follow (non-first) frames; and the previous stage only addressed the move of follow frames to next pages. Initially such a document looked like this:

https://farm5.staticflickr.com/4360/35957292924_2af502ffc7_o.png
https://farm5.staticflickr.com/4399/35957292834_dc2ce35f85_o.png

Merge a split section

The last piece was moving paragraphs back to previous pages when there is again space for them. Initially we did not use the newly available space:

https://farm5.staticflickr.com/4432/35982835413_99a65febe2_o.png

After commit tdf#108524 sw: handle sections inside tables in SwFrame::GetPrevSctLeaf() the paragraph is moved back properly:

https://farm5.staticflickr.com/4408/35982835283_1c2002254b_o.png

One more thing…

Given that all code changes affect how sections in tables are handled in a parent frame in general (which is a body frame in all the above pictures), the same changes are also usable for other parent containers as well, e.g. linked text frames. Here is how that looks like:

https://farm5.staticflickr.com/4342/35982835353_25d609548d_o.png

That’s it for now — as usual the commits are in master, so you can try this right now with a 6.0 daily build. :-)


Mail merge Writer data source

Estimated read time: 1 minutes

If you ever used the mail merge wizard with data sources, then you know how it works: it typically needs some kind of data source (e.g. a Calc spreadsheet), a Writer document containing the email or letter (that contains fields), and then mail merge can generate the personalized documents for you.

In case you have an existing document where you already have such data in a Writer table, you had to somehow transfer it to one of the formats for which there was a data source driver, and then you could use it inside mail merge. I’ve now added a dedicated Writer driver in connectivity/, so picking up data directly from Writer tables is now possible.

If you are interested how this looks like, here is a demo (click on the image to see the video):

That’s it for now — as usual the commits are in master, so you can try this right now with a 6.0 daily build. :-)

© Miklos Vajna. Built using Pelican. Theme by Giulio Fidente on github.