This is a follow-up to the previous post that described how it is now possible to insert a PDF file as an image in LibreOffice and export that back to PDF, while keeping the original PDF contents. I’ve recently improved this feature so the resulting file is smaller and the vector image can be viewed in more viewers. First, thanks to PMG who made this work possible.
Let’s look at the previously mentioned front page of a magazine sample when it’s viewed in okular. (A KDE pdf viewer, i.e. something that’s not Adobe Acrobat). The previously used reference XObject PDF markup is not handled by it, so the bitmap fallback was displayed:
Compare it with the new result:
Notice the sharp text in the first line.
Also the size of this sample is smaller now, since we don’t write a large bitmap, and the not shown second page of the PDF image: 2 385 984 → 1 605 558 bytes (about one third of the output is avoided).
Both techniques have pros and cons, here is a summary:
The reference XObject approach allows you to preserve the full PDF data of the image: if it was of multiple pages, even that. Also, the LibreOffice code for this is simple: we just preserve a byte array — that can hardly go wrong. The problem is that no non-Acrobat PDF viewer implements this, including e.g. your printer most probably.
The new approach uses the tokenizer I originally wrote for
PDF signature verification purposes — it extracts
the page stream of the first page from the original file and uses it as a
form XObject in the export result — this is the same as how e.g.
works. This markup is handled by almost all PDF viewers and also the
resulting size is smaller, since the data of other pages is dropped and there
is no fallback bitmap. The problem may be that this is a much more complex
scenario, so it may go wrong (as usual, bugreports
Nevertheless, the new approach seems like a much better default, so LibreOffice no longer writes the reference XObject approach unless you explicitly request it in the PDF export dialog.
Some perhaps interesting details:
PDF page streams may be provided by multiple objects, but form XObjects must have a single stream, so it we handle the case when different parts of the page stream are compressed in different ways.
LibreOffice writes PDF-1.4 by default, in case you insert a PDF image that uses PDF-1.5+, we use pdfium to downgrade that markup to 1.4, and only then insert it.
Copying the page stream of the image is not enough, we also recursively copy all referenced objects from the source PDF, while rewriting all contained references, since the objects IDs in the old and new files differ. We also take care of proper scoping of named references in the resource dictionary, so you can use this feature recursively (insert a document as a PDF image, even if that document itself contains PDF images already). :-)
All this is available in LibreOffice master, towards 5.4.
Monday, 20 March 2017
LibreOffice now uses pdfium to render inserted PDF images (Comments)
pdfium is the rendering library used in Chromium’s pdf viewer. It’s based on the foxit pdf renderer and its rendering quality is much better compared to the pre-existing "convert PDF to ODG, then to an image" code when it comes to just viewing a PDF file. First, thanks to PMG who made this work possible.
Let’s look at a few samples that compare the old pdfimport rendering result and the new pdfium-based one. One important feature is that embedded fonts are handled. This is how this inserted PDF looked like previously:
Compare it with the new result:
Now let’s see the front page of a magazine, you can see 4 unexpected artifacts:
Finally a problem with pdfium was that LibreOffice got bitmaps from it, so in case you re-exported to PDF, the quality of these PDF images were worse than in the original PDF file. The PDF specification has a reference XObject feature that helps in this case: it allows the PDF export to still write the bitmap to the exported PDF, but in case the reader supports this feature, the vector-based original file will be shown, not the bitmap.
Here is a simple hand-crafted star in a PDF file, as it looked initially:
This is how it looks after LibreOffice’s PDF export learned to emit reference XObjects:
All this is available in LibreOffice master, towards 5.4.
Monday, 13 March 2017
ECDSA support in xmlsec-nss, bundled by LibreOffice (Comments)
Last month a LibreOffice bugreport was filed, as the ODF signature created with Hungarian citizen eID cards is not something LibreOffice can verify. After a bit of research it seemed that LibreOffice and NSS (what we use for crypto work on Linux/macOS) is not a problem, but xmlsec’s NSS backend does not recognize ECDSA keys (RSA or DSA keys work fine).
The xmlsec improvements happened in these pull requests:
After this the xmlsec code looked good enough. I had to request an update of the bugdoc in the TDF bug twice, as the signature itself looked also incorrect initially:
an attribute type in the signature that had no official abbreviation was described as "UNDEF" instead of the dotted decimal form
RFC3279 specifies that an ECDSA signature value in general should be ASN1-encoded in general, but RFC4050 is specific to XML digital signatures and that one says it should not be ASN1-encoded. The bugdoc was initially ASN1-encoded.
Finally a warning still remains: while trying to parse the text of the
<X509IssuerName> element, the dotted decimal form is still not parsed (see
this NSS bugreport). The
bug is confirmed on the mailing list, but no other progress have been made so
Oh, and of course: Windows is still untouched, there a bigger problem remains: we use CryptoAPI (not CNG) there, and that does not support ECDSA at all. Hooray for open-source libs where you can add such support yourself. ;-)
PDF supports screen annotations, which means it’s possible to play embedded and linked videos on top of a static image. Given that LibreOffice also supports videos, it made sense to add support for this in our PDF export filter. First, thanks to PMG who made this work possible. This is currently added for Writer and Impress.
Linked videos are the situation when the video is not part of the document itself, but it’s located somewhere else, e.g. a http:// location. This is helpful if you want to email around a PDF file, and want to avoid sending large files when it has video content.
The result can be played using Adobe Acrobat Reader — for some reason okular on Linux is a bit confused about http:// URLs, wants to convert them to relative ones, and then fails as of today.
tdf#105093 is the embedded video case, this is handy in case you want to create an entirely self-contained PDF, where even the video content is inside the PDF file as an embedded file.
Regarding the situation around various video containers and codecs, the above code is quite agnostic. :-) On the LibreOffice side all we require is to be able to extract a key frame from the video to provide a preview image, so e.g. on Linux the support depends on what gstreamer plugins you have installed. The video content is written to the PDF file as-is, so again if it will work in the PDF reader is up to the reader’s codec support. On Linux e.g. okular uses vlc for video playback, so the range of supported formats is quite wide. The same is true on Windows, what I personally tested is LibreOffice’s VLC backend and the embedded QuickTime player in Acrobat Reader.
All of this is available on LibreOffice master towards 5.4.
FOSDEM 2017 is here this weekend, and as Michael Stahl pointed out, this (together with the LibreOffice annual conference) are two time periods each year when lots of Impress bugfixes are made, as people start dogfooding. ;-) So below you can read about a pair of Impress bugs I fixed recently.
tdf#105502 is a situation where you have an Impress table shape, and you select part of the cells, then you click on the sidebar to change the font size. Previously this affected all cells of the table shape, now only the selected cells are updated.
tdf#105150 is a PPT(X) filter bug where a shape was previously imported as transparent, but it actually has to have the same fill type as the slide background. In case of PPTX this was already handled in general, but not in case the slide had no explicit background. The result was that in case the shape was used to cover other shapes, they were visible, leading to e.g. this unexpected red rectangle on the screenshot.
The same bug was present in the PPT import, though there existing support was even more limited: just the "background colored objects" were collected, but nothing was done to them. Now the above use-case should be as good for PPT as it is for PPTX.
As mentioned in the blog post of Mike already, last month we were allowed to hack on anything we want in LibreOffice for a few days. I used this time to progress with 3 different topics.
Given that a Writer shape with a TextBox is internally two shapes, this needed explicit support. After my TextBox bugfix it’s possible to have two such shapes in a document, and once you select one of them, tab properly jumps between the two shapes; previously nothing happened.
What did happen is we tried to activate the TextBox of the selected shape, which selected the shape itself, so at the end nothing happened.
Last, I’ve started working on upstreaming
external/libxmlsec/xmlsec1-noverify.patch.1. xmlsec has no ability to
disable the verification of certificates (think of
curl -k or
wget -k), so
in LibreOffice currently we just patch out that code as we don’t need it. So I
wanted to add a new verification flag to avoid patching, but it turns out that
in the NSS case xmlsec didn’t do the verification, so as a first step I fixed
that instead in this xmlsec GitHub
pull request. Now that it’s merged, the next step will be to add such a flag,
and then LibreOffice can get rid of the patch after the next xmlsec release.
Building on top of the previously mentioned signing of existing PDF files work, one more PDF feature coming in LibreOffice 5.3 is initial support for the PDF Advanced Electronic Signatures (PAdES) standard. First, thanks to the Dutch Ministry of Defense in cooperation with Nou&Off who made this work possible.
PAdES is an extension of the ISO PDF signature with additional constraints, so that it conforms to the requirements of the European eIDAS regulation, which in turns makes it more likely that your signed PDF document will be actually legally binding in many EU member states.
The best way to check if LibreOffice produces such PDF signatures is to use a PAdES validator. So far I found two of them:
As it can be seen above, the PDF signature produced by LibreOffice 5.3 by default conforms to the PAdES baseline spec.
I implemented the followings in LO to make this happen:
PDF signature creation now defaults to the stronger SHA-256 (instead of the previously used weaker SHA-1), and the PDF verifier understands SHA-256
the PDF signature creation now embeds the signing certificate into the PKCS#7 signature blob in the PDF, so the verifier can check not only the key used for the signing, but the actual certificate as well
the PDF signature import can now detect if such an embedded signing certificate is present in the signature or not
|Don’t get confused, LO does signature verification (checks if the digest matches and validates the certificate) and now shows if the signing certificate is present in the signature or not, but it doesn’t do more than that, the above mentioned DSS tool is still superior when it comes to do a full validation of a PAdES signature.|
As usual, this works both with NSS and MS CryptoAPI. In the previous post I noted that one task was easier with CryptoAPI. Here I experienced the opposite: when writing the signing certificate hash, I could provide templates to NSS on how the ASN.1 encoding of it should happen, and NSS did the actual ASN.1 DER encoding for me. In the CryptoAPI case there is no such API, so I had to do this encoding manually (see CreateSigningCertificateAttribute()), which is obviously much more complicated.
Another pain was that the DSS tool doesn’t really separate the validation of the signature itself and of the certificate. The above screenshot was created using a non-self-signed certificate, hence the unclear part in the signed-by row.
If you want to try these out yourself, get a
daily build and feel free to play
with it. This work is part of both
libreoffice-5-3, so those
builds are of interest. Happy testing! :-)
TL;DR: see above — it’s now possible signing existing PDF files and also verify those signatures in LibreOffice 5.3.
LibreOffice already made it possible to digitally sign PDF files as part of the PDF export, so in case you had e.g. ODF documents and exported them to PDF, optionally a single digital signature could be added as part of the export process. This is now much improved. First, thanks to the Dutch Ministry of Defense in cooperation with Nou&Off who made this work possible.
A user can already use an other application to verify that signature or sign an already existing PDF file. The idea is to allow doing these from inside LibreOffice, directly.
As it can be seen above, now the Digital Signatures dialog not only works for ODF and OOXML files, but also for PDF files. If the file has been signed, then the dialog performs verifications of that signature. Signatures are also verified on opening any signed PDF file.
I’ve also extended the user interface a bit, so that signing an existing PDF file is easy, similarly how exporting to PDF is easier than exporting to a random other file format. There is now a new File → Digital signatures → Sign exiting PDF menu item to open a PDF file for signing:
When that happens the infobar has a dedicated button to open the Digital Signatures dialog, and also going into editing mode triggers a warning dialog, as going read-write is not needed to be able to sign a document:
And that’s basically it, after you open a PDF file in Draw, you can do the usual digital signature operations on the file, just like it already works for previously supported file formats.
What follows is something you can probably skip if you’re a user — however if you’re a developer and you want to understand how the above is implemented, then read on. ;-)
The signing feature in ODF/OOXML is implemented by working directly on the ZIP
xmlsecurity/. This means that in the PDF case it’s necessary to
work on the PDF file directly, except that we had no such PDF tokenizer
ready to be used.
xmlsecurity/source/pdfio/ now is such a tokenizer that can
extract info from PDF files and can also add incremental updates at the end of
the file, this way we can make sure adding a signature to a file won’t loose
existing content in the file. This is fundamentally different form the usual
load-edit-save workflow, when we convert the file into a document model, and
work on that.
Previously LO was only able to generate signatures, not verify them. I’ve implemented PDF signature verification using both NSS and CryptoAPI, so all Windows, Linux and macOS are covered. I have to admit that the initial verification was much easier with CryptoAPI. Until I hit corner-cases, I could use an API that’s well-documented and is higher level than NSS. (I don’t have to support different hash types explicitly, for example.)
When I added support for non-detached signatures, that changed the situation a bit:
1 file changed, 15 insertions(+), 11 deletions(-)
was the NSS patch, and
1 file changed, 104 insertions(+), 8 deletions(-)
was the CryptoAPI patch.
Signing an existing file means tokenizing a document, figuring out how an incremental update should look like for that file, writing an incremental update that has a placeholder for the actual signature (a PKCS#7 blob, where the input is just the non-placeholder parts of the document as binary data), and finally filling in the placeholder with the actual signature.
For the last step, I could reuse code from the PDF export (modulo fixing bugs like tdf#99327). For the other steps, the tokenizer remembers the input offset / length for the given token, this way it’s relatively easy to create incremental updates. You can add new objects or update new objects in such an incremental update, and this source tracking feature allows copying even the unchanged parts of updated objects verbatim.
Everything becomes a bit more complicated once I started to handle not only LO-generated PDF-1.4, but also newer PDF versions. I think this is important, as Adobe Acrobat creates PDF 1.6 by default today, which has a number of new features (I think all of them were actually introduced in PDF-1.5) that affects the tokenizer:
xref stream: instead of an ASCII xref table ("table of contents") at the end of the file, it’s now possible to write the binary equivalent of this as an xref stream. Because the binary version can describe more features we must also write an updated xref stream (and not an xref table) when the import already had an xref stream.
object streams: it’s now possible to write multiple objects inside the stream section of a single object in binary form. The tokenizer is necessary to be able to read these objects and also roundtripping (source tracking) should work not only with physical file offsets, but also inside such compressed streams where the offset is no longer just a number inside the input file. (It’s OK to write the updated objects outside object streams, still.)
stream predictors: this is a concept from the PNG format, but also used in PDF when compressing the xref stream. See the spec for the gory details, but in short it’s not enough that instead of plaintext you have to deal with binary compressed data, you also have to filter the data before actually parsing the file offsets, and the filter is defined not in terms of object IDs and file offsets, but in terms of adjacent pixels, since it’s documented in the PNG spec. :-) (To be close to the Adobe output, we also apply such predictors when writing compressed xref streams.)
In addition to be UI changes already mentioned above, one more improvement I did is that now the Digital Signatures dialog has a new column to show the signature type. This is either XML-DSig (for ODF/OOXML) or PDF.
I’ve added an integration test in the existing
CppunitTest_xmlsecurity_signing to have coverage for the small new code that
sfx2/ in case of PDF files. But fortunately
because all other code in
xmlsecurity/ was new, I could do unit testing in
CppunitTest_xmlsecurity_pdfsigning for the rest of the features.
Needless to say that invoking the PDF tokenizer + signature creator/verifier directly is much quicker than loading a full PDF file into Draw, just to see the signature status. ;-)
If you want to try these out yourself, get a
daily build and play with it! This
work is part of both
libreoffice-5-3, so those builds are of
interest. Happy testing! :-)
Today I gave a Getting involved with LibreOffice Online and Android session at DevTalks Jr, Bucharest. The event had two tracks in parallel, with a total attendees of about 200 developers.
Some photos I took after the event are available.
Thanks the organizers and sponsors for the great event! :-)
LibreOffice 5.3 will add one more vector-based format that can be inserted as an image into documents: PDF. First, thanks to PMG who made this work possible. On the user interface you can now select PDF files when you choose e.g. Writer’s Insert → Image option:
The first page of the PDF document will be shown, which is handy if the PDF file is basically used as a vector image format.
Similarly to the SVG feature, the original vector image is stored in the document, but when saving to ODF, a replacement PNG file is also generated to be backwards compatible with older ODF readers. The image context menu → Save menu item allows to extract your original PDF data from the image, too:
And that’s it, as long as you save your document in ODF, your PDF-as-an-image will be kept without loosing any data. As usual, you can try this right now with a 5.3 daily build. :-)
However, if you’re interested in how this is implemented, keep reading…
The PDF image in the document model is really similar to how SVG is handled,
Graphic::getSvgData(), there is now a
This new member function exposes the original PDF data, otherwise the Graphic
is just a metafile.
ReplacementGraphicURL property of the image at an UNO level now exposes
the generated metafile for PDF images. This is implemented for both Draw and
Writer images, and is used by the ODF export filter.
Graphic instance is rendered, the layout knows nothing about the
PDF data attached to the object, only parses the generated metafile. This way
the display of the PDF image works out of the box.
First I’ve implemented a PDF import-as-graphic filter, then the export equivalent of it. As you can see, the PDF import-as-graphic filter isn’t too complicated, it completely reuses the existing "import PDF into Draw" filter, it simply copies the first page of the resulting document model as a metafile.
Second, once the graphic filters were working, I’ve also
the ODF import to recognize PDF data — the export side needed no explicit
work, once the
ReplacementGraphicURL bits were in place.
As mentioned above, the Draw and the Writer image implementation is separate,
so first I’ve added tests for ODT files in the
CppunitTest_sw_odfexport, and then
cover ODP files (and other ODF formats). Second, the PDF part of the graphic
swapout/in code has a dedicated test in
the UI’s "Save original PDF" feature has a new
Oh, and if you intent to test this manually in a self-created build, make sure
--disable-pdfimport, otherwise this feature can’t work. ;-)