Improved rountrip of PDF images in LibreOffice

Posted on: Tue 18 April 2017

Estimated read time: 3 minutes

This is a follow-up to the previous post that described how it is now possible to insert a PDF file as an image in LibreOffice and export that back to PDF, while keeping the original PDF contents. I’ve recently improved this feature so the resulting file is smaller and the vector image can be viewed in more viewers. First, thanks to PMG who made this work possible.

Let’s look at the previously mentioned front page of a magazine sample when it’s viewed in okular. (A KDE pdf viewer, i.e. something that’s not Adobe Acrobat). The previously used reference XObject PDF markup is not handled by it, so the bitmap fallback was displayed:

https://farm4.staticflickr.com/3947/34031939205_5315a9afb4_o.png

Compare it with the new result:

https://farm3.staticflickr.com/2830/34031939425_24b9a126ee_o.png

Notice the sharp text in the first line.

Also the size of this sample is smaller now, since we don’t write a large bitmap, and the not shown second page of the PDF image: 2 385 984 → 1 605 558 bytes (about one third of the output is avoided).

Both techniques have pros and cons, here is a summary:

The reference XObject approach allows you to preserve the full PDF data of the image: if it was of multiple pages, even that. Also, the LibreOffice code for this is simple: we just preserve a byte array — that can hardly go wrong. The problem is that no non-Acrobat PDF viewer implements this, including e.g. your printer most probably.
The new approach uses the tokenizer I originally wrote for PDF signature verification purposes — it extracts the page stream of the first page from the original file and uses it as a form XObject in the export result — this is the same as how e.g. pdfcrop works. This markup is handled by almost all PDF viewers and also the resulting size is smaller, since the data of other pages is dropped and there is no fallback bitmap. The problem may be that this is a much more complex scenario, so it may go wrong (as usual, bugreports are welcome).

Nevertheless, the new approach seems like a much better default, so LibreOffice no longer writes the reference XObject approach unless you explicitly request it in the PDF export dialog.

Some perhaps interesting details:

PDF page streams may be provided by multiple objects, but form XObjects must have a single stream, so it we handle the case when different parts of the page stream are compressed in different ways.
LibreOffice writes PDF-1.4 by default, in case you insert a PDF image that uses PDF-1.5+, we use pdfium to downgrade that markup to 1.4, and only then insert it.
Copying the page stream of the image is not enough, we also recursively copy all referenced objects from the source PDF, while rewriting all contained references, since the objects IDs in the old and new files differ. We also take care of proper scoping of named references in the resource dictionary, so you can use this feature recursively (insert a document as a PDF image, even if that document itself contains PDF images already). :-)

All this is available in LibreOffice master, towards 5.4.

LibreOffice now uses pdfium to render inserted PDF images

Posted on: Mon 20 March 2017

Estimated read time: 2 minutes

pdfium is the rendering library used in Chromium’s pdf viewer. It’s based on the foxit pdf renderer and its rendering quality is much better compared to the pre-existing "convert PDF to ODG, then to an image" code when it comes to just viewing a PDF file. First, thanks to PMG who made this work possible.

Let’s look at a few samples that compare the old pdfimport rendering result and the new pdfium-based one. One important feature is that embedded fonts are handled. This is how this inserted PDF looked like previously:

https://farm4.staticflickr.com/3727/33163219940_3a2a3278a0_o.png

Compare it with the new result:

https://farm3.staticflickr.com/2927/33547029855_92c1a5150d_o.png

Now let’s see the front page of a magazine, you can see 4 unexpected artifacts:

https://farm4.staticflickr.com/3948/33563793222_8a6b8e8a6b_z.jpg

New result:

https://farm3.staticflickr.com/2809/33547029645_de7cbcd800_z.jpg

Finally a problem with pdfium was that LibreOffice got bitmaps from it, so in case you re-exported to PDF, the quality of these PDF images were worse than in the original PDF file. The PDF specification has a reference XObject feature that helps in this case: it allows the PDF export to still write the bitmap to the exported PDF, but in case the reader supports this feature, the vector-based original file will be shown, not the bitmap.

Here is a simple hand-crafted star in a PDF file, as it looked initially:

https://farm3.staticflickr.com/2915/33163219680_30f63b4a82_z.jpg

This is how it looks after LibreOffice’s PDF export learned to emit reference XObjects:

https://farm4.staticflickr.com/3933/33547029485_4f487bb26c_z.jpg

All this is available in LibreOffice master, towards 5.4.

ECDSA support in xmlsec-nss, bundled by LibreOffice

Posted on: Mon 13 March 2017

Estimated read time: 2 minutes

Last month a LibreOffice bugreport was filed, as the ODF signature created with Hungarian citizen eID cards is not something LibreOffice can verify. After a bit of research it seemed that LibreOffice and NSS (what we use for crypto work on Linux/macOS) is not a problem, but xmlsec’s NSS backend does not recognize ECDSA keys (RSA or DSA keys work fine).

The xmlsec improvements happened in these pull requests:

prepare the xmlsec ECDSA tests, so that they can test not only openssl, but NSS as well
add initial ECDSA support (SHA256 hashing only)
SHA1 support
SHA512 support
fix read of uninitialized memory

After this the xmlsec code looked good enough. I had to request an update of the bugdoc in the TDF bug twice, as the signature itself looked also incorrect initially:

an attribute type in the signature that had no official abbreviation was described as "UNDEF" instead of the dotted decimal form
RFC3279 specifies that an ECDSA signature value in general should be ASN1-encoded in general, but RFC4050 is specific to XML digital signatures and that one says it should not be ASN1-encoded. The bugdoc was initially ASN1-encoded.

Finally a warning still remains: while trying to parse the text of the <X509IssuerName> element, the dotted decimal form is still not parsed (see this NSS bugreport). The bug is confirmed on the mailing list, but no other progress have been made so far.

Oh, and of course: Windows is still untouched, there a bigger problem remains: we use CryptoAPI (not CNG) there, and that does not support ECDSA at all. Hooray for open-source libs where you can add such support yourself. ;-)

okular now supports online videos in PDF files

Posted on: Mon 27 February 2017

Estimated read time: 1 minutes

I rarely write about work did by others, but given that in the previous post I mentioned that linked videos (when they are a http:// URL) are not working, I had a very positive experience which is worth noting. :-)

I got a private mail nagging me to file a bug and just 3 days later it got fixed!

Thanks to both Oliver for the nagging and to Albert for the actual fix.

LibreOffice PDF export now supports videos

Posted on: Thu 16 February 2017

Estimated read time: 2 minutes

https://farm4.staticflickr.com/3924/32549564340_4d0990cfa4_o.png

PDF supports screen annotations, which means it’s possible to play embedded and linked videos on top of a static image. Given that LibreOffice also supports videos, it made sense to add support for this in our PDF export filter. First, thanks to PMG who made this work possible. This is currently added for Writer and Impress.

Linked videos

Linked videos are the situation when the video is not part of the document itself, but it’s located somewhere else, e.g. a http:// location. This is helpful if you want to email around a PDF file, and want to avoid sending large files when it has video content.

tdf#104841 is about this situation, first I added support for linked videos in Impress, then also in Writer.

The result can be played using Adobe Acrobat Reader — for some reason okular on Linux is a bit confused about http:// URLs, wants to convert them to relative ones, and then fails as of today.

Embedded videos

https://farm3.staticflickr.com/2666/32115175413_ec6f64243a_z.jpg

tdf#105093 is the embedded video case, this is handy in case you want to create an entirely self-contained PDF, where even the video content is inside the PDF file as an embedded file.

After Impress support (and a trick around Draw vs Impress shapes) the Writer part wasn’t too complicated.

Regarding the situation around various video containers and codecs, the above code is quite agnostic. :-) On the LibreOffice side all we require is to be able to extract a key frame from the video to provide a preview image, so e.g. on Linux the support depends on what gstreamer plugins you have installed. The video content is written to the PDF file as-is, so again if it will work in the PDF reader is up to the reader’s codec support. On Linux e.g. okular uses vlc for video playback, so the range of supported formats is quite wide. The same is true on Windows, what I personally tested is LibreOffice’s VLC backend and the embedded QuickTime player in Acrobat Reader.

All of this is available on LibreOffice master towards 5.4.

Impress bugfixes, in time for FOSDEM 2017

Posted on: Tue 31 January 2017

Estimated read time: 2 minutes

https://farm1.staticflickr.com/334/32605456735_ac88121be8_o.png

FOSDEM 2017 is here this weekend, and as Michael Stahl pointed out, this (together with the LibreOffice annual conference) are two time periods each year when lots of Impress bugfixes are made, as people start dogfooding. ;-) So below you can read about a pair of Impress bugs I fixed recently.

Changing font size now takes table selection into account

tdf#105502 is a situation where you have an Impress table shape, and you select part of the cells, then you click on the sidebar to change the font size. Previously this affected all cells of the table shape, now only the selected cells are updated.

Background fill for shapes

https://farm1.staticflickr.com/277/31761747774_4b1e6b8d38_o.png

tdf#105150 is a PPT(X) filter bug where a shape was previously imported as transparent, but it actually has to have the same fill type as the slide background. In case of PPTX this was already handled in general, but not in case the slide had no explicit background. The result was that in case the shape was used to cover other shapes, they were visible, leading to e.g. this unexpected red rectangle on the screenshot.

The same bug was present in the PPT import, though there existing support was even more limited: just the "background colored objects" were collected, but nothing was done to them. Now the above use-case should be as good for PPT as it is for PPTX.

Hack-(rest-of-the)-week at Collabora

Posted on: Tue 17 January 2017

Estimated read time: 2 minutes

https://farm1.staticflickr.com/726/32306648426_b4ee93f6a1_o.png

As mentioned in the blog post of Mike already, last month we were allowed to hack on anything we want in LibreOffice for a few days. I used this time to progress with 3 different topics.

Stepping through TextBoxes using the keyboard

Given that a Writer shape with a TextBox is internally two shapes, this needed explicit support. After my TextBox bugfix it’s possible to have two such shapes in a document, and once you select one of them, tab properly jumps between the two shapes; previously nothing happened.

What did happen is we tried to activate the TextBox of the selected shape, which selected the shape itself, so at the end nothing happened.

RTF improvements

For some time it was already possible to import and export custom string document properties from/to RTF, but just in case the value type of the property was string. Now I extended support for these custom properties, so also the remaining types are handled: numbers, bools, doubles and dates.

xmlsec patch upstreaming

Last, I’ve started working on upstreaming external/libxmlsec/xmlsec1-noverify.patch.1. xmlsec has no ability to disable the verification of certificates (think of curl -k or wget -k), so in LibreOffice currently we just patch out that code as we don’t need it. So I wanted to add a new verification flag to avoid patching, but it turns out that in the NSS case xmlsec didn’t do the verification, so as a first step I fixed that instead in this xmlsec GitHub pull request. Now that it’s merged, the next step will be to add such a flag, and then LibreOffice can get rid of the patch after the next xmlsec release.

PAdES support for PDF files in LibreOffice

Posted on: Tue 20 December 2016

Estimated read time: 3 minutes

https://farm1.staticflickr.com/443/30919530974_7ab3d132d8_z.jpg

Building on top of the previously mentioned signing of existing PDF files work, one more PDF feature coming in LibreOffice 5.3 is initial support for the PDF Advanced Electronic Signatures (PAdES) standard. First, thanks to the Dutch Ministry of Defense in cooperation with Nou&Off who made this work possible.

Results

PAdES is an extension of the ISO PDF signature with additional constraints, so that it conforms to the requirements of the European eIDAS regulation, which in turns makes it more likely that your signed PDF document will be actually legally binding in many EU member states.

The best way to check if LibreOffice produces such PDF signatures is to use a PAdES validator. So far I found two of them:

the ETSI one, which requires registration and is a free web service
Digital Signature Service (DSS), which is an open source tool you can build, use and modify locally

As it can be seen above, the PDF signature produced by LibreOffice 5.3 by default conforms to the PAdES baseline spec.

Implementation

I implemented the followings in LO to make this happen:

PDF signature creation now defaults to the stronger SHA-256 (instead of the previously used weaker SHA-1), and the PDF verifier understands SHA-256
the PDF signature creation now embeds the signing certificate into the PKCS#7 signature blob in the PDF, so the verifier can check not only the key used for the signing, but the actual certificate as well
the PDF signature import can now detect if such an embedded signing certificate is present in the signature or not

Note

Don’t get confused, LO does signature verification (checks if the digest matches and validates the certificate) and now shows if the signing certificate is present in the signature or not, but it doesn’t do more than that, the above mentioned DSS tool is still superior when it comes to do a full validation of a PAdES signature.

As usual, this works both with NSS and MS CryptoAPI. In the previous post I noted that one task was easier with CryptoAPI. Here I experienced the opposite: when writing the signing certificate hash, I could provide templates to NSS on how the ASN.1 encoding of it should happen, and NSS did the actual ASN.1 DER encoding for me. In the CryptoAPI case there is no such API, so I had to do this encoding manually (see CreateSigningCertificateAttribute()), which is obviously much more complicated.

Another pain was that the DSS tool doesn’t really separate the validation of the signature itself and of the certificate. The above screenshot was created using a non-self-signed certificate, hence the unclear part in the signed-by row.

If you want to try these out yourself, get a daily build and feel free to play with it. This work is part of both master or libreoffice-5-3, so those builds are of interest. Happy testing! :-)

Signing existing PDF files in LibreOffice

Posted on: Mon 12 December 2016

Estimated read time: 6 minutes

https://farm1.staticflickr.com/542/31208923460_2eb9bca147_z.jpg

TL;DR: see above — it’s now possible signing existing PDF files and also verify those signatures in LibreOffice 5.3.

The problem

LibreOffice already made it possible to digitally sign PDF files as part of the PDF export, so in case you had e.g. ODF documents and exported them to PDF, optionally a single digital signature could be added as part of the export process. This is now much improved. First, thanks to the Dutch Ministry of Defense in cooperation with Nou&Off who made this work possible.

A user can already use an other application to verify that signature or sign an already existing PDF file. The idea is to allow doing these from inside LibreOffice, directly.

Results

As it can be seen above, now the Digital Signatures dialog not only works for ODF and OOXML files, but also for PDF files. If the file has been signed, then the dialog performs verifications of that signature. Signatures are also verified on opening any signed PDF file.

I’ve also extended the user interface a bit, so that signing an existing PDF file is easy, similarly how exporting to PDF is easier than exporting to a random other file format. There is now a new File → Digital signatures → Sign exiting PDF menu item to open a PDF file for signing:

https://farm1.staticflickr.com/30/31543378146_b981a22ee0_z.jpg

When that happens the infobar has a dedicated button to open the Digital Signatures dialog, and also going into editing mode triggers a warning dialog, as going read-write is not needed to be able to sign a document:

https://farm1.staticflickr.com/752/31543377356_eaed7c2301_z.jpg

And that’s basically it, after you open a PDF file in Draw, you can do the usual digital signature operations on the file, just like it already works for previously supported file formats.

Details

What follows is something you can probably skip if you’re a user — however if you’re a developer and you want to understand how the above is implemented, then read on. ;-)

PDF tokenizer

The signing feature in ODF/OOXML is implemented by working directly on the ZIP storage in xmlsecurity/. This means that in the PDF case it’s necessary to work on the PDF file directly, except that we had no such PDF tokenizer ready to be used.

Code under xmlsecurity/source/pdfio/ now is such a tokenizer that can extract info from PDF files and can also add incremental updates at the end of the file, this way we can make sure adding a signature to a file won’t loose existing content in the file. This is fundamentally different form the usual load-edit-save workflow, when we convert the file into a document model, and work on that.

Verification of signatures

Previously LO was only able to generate signatures, not verify them. I’ve implemented PDF signature verification using both NSS and CryptoAPI, so all Windows, Linux and macOS are covered. I have to admit that the initial verification was much easier with CryptoAPI. Until I hit corner-cases, I could use an API that’s well-documented and is higher level than NSS. (I don’t have to support different hash types explicitly, for example.)

When I added support for non-detached signatures, that changed the situation a bit:

 1 file changed, 15 insertions(+), 11 deletions(-)

was the NSS patch, and

 1 file changed, 104 insertions(+), 8 deletions(-)

was the CryptoAPI patch.

Signing existing files

Signing an existing file means tokenizing a document, figuring out how an incremental update should look like for that file, writing an incremental update that has a placeholder for the actual signature (a PKCS#7 blob, where the input is just the non-placeholder parts of the document as binary data), and finally filling in the placeholder with the actual signature.

For the last step, I could reuse code from the PDF export (modulo fixing bugs like tdf#99327). For the other steps, the tokenizer remembers the input offset / length for the given token, this way it’s relatively easy to create incremental updates. You can add new objects or update new objects in such an incremental update, and this source tracking feature allows copying even the unchanged parts of updated objects verbatim.

PDF 1.5+

Everything becomes a bit more complicated once I started to handle not only LO-generated PDF-1.4, but also newer PDF versions. I think this is important, as Adobe Acrobat creates PDF 1.6 by default today, which has a number of new features (I think all of them were actually introduced in PDF-1.5) that affects the tokenizer:

xref stream: instead of an ASCII xref table ("table of contents") at the end of the file, it’s now possible to write the binary equivalent of this as an xref stream. Because the binary version can describe more features we must also write an updated xref stream (and not an xref table) when the import already had an xref stream.
object streams: it’s now possible to write multiple objects inside the stream section of a single object in binary form. The tokenizer is necessary to be able to read these objects and also roundtripping (source tracking) should work not only with physical file offsets, but also inside such compressed streams where the offset is no longer just a number inside the input file. (It’s OK to write the updated objects outside object streams, still.)
stream predictors: this is a concept from the PNG format, but also used in PDF when compressing the xref stream. See the spec for the gory details, but in short it’s not enough that instead of plaintext you have to deal with binary compressed data, you also have to filter the data before actually parsing the file offsets, and the filter is defined not in terms of object IDs and file offsets, but in terms of adjacent pixels, since it’s documented in the PNG spec. :-) (To be close to the Adobe output, we also apply such predictors when writing compressed xref streams.)

User Interface

In addition to be UI changes already mentioned above, one more improvement I did is that now the Digital Signatures dialog has a new column to show the signature type. This is either XML-DSig (for ODF/OOXML) or PDF.

Testing

I’ve added an integration test in the existing CppunitTest_xmlsecurity_signing to have coverage for the small new code that calls into xmlsecurity/ from sfx2/ in case of PDF files. But fortunately because all other code in xmlsecurity/ was new, I could do unit testing in CppunitTest_xmlsecurity_pdfsigning for the rest of the features.

Needless to say that invoking the PDF tokenizer + signature creator/verifier directly is much quicker than loading a full PDF file into Draw, just to see the signature status. ;-)

Summary

If you want to try these out yourself, get a daily build and play with it! This work is part of both master or libreoffice-5-3, so those builds are of interest. Happy testing! :-)

LibreOffice session at DevTalks Jr.

Posted on: Sat 12 November 2016

Estimated read time: 1 minutes

(via DevTalksRo)

Today I gave a Getting involved with LibreOffice Online and Android session at DevTalks Jr, Bucharest. The event had two tracks in parallel, with a total attendees of about 200 developers.

Some photos I took after the event are available.

Thanks the organizers and sponsors for the great event! :-)