Index ¦ Archives ¦ RSS > Tag: en

DOC support in mso-dumper

Estimated read time: 2 minutes

mso-dumper is a project that creates some — more or less human-readable — dump from binary files. Initially Kohei Yoshida developed it to dump XLS, then Thorsten Behrens added support for PPT files, finally during last November I started to add DOC support.

You may ask: why that is useful? My answer is that I spend quite some time on the import/export filters of LibreOffice Writer, and to be able to improve or fix such filters, some knowledge of the file format in question and Writer internals is needed. Regarding the file format knowledge, I find it much easier to read the specification once and implement some simple dumper based on that — than reading the specification again and again, and just trying to understand what’s going on inside a binary file using a hex editor.

To my knowledge, such a dumper for the DOC format (in particular the WW8 version of it) did not exist previously. WW8Dumper was the closest match, but that was far from complete and I found extending mso-dumper easier.

To stress-test the parser, I used get-bugzilla-attachments-by-mimetype to get all DOC attachements from the Freedesktop bugzilla, and during the last days I fixed the remaining crashes (actually this is why I write this post now ;-) ). If you want to try it out you can do so by:

git clone git://anongit.freedesktop.org/libreoffice/contrib/mso-dumper
cd mso-dumper
./doc-dump.py /path/to/doc/file.doc

The idea is that on any input the dumper should not crash: instead either it should give you usable result, or in case some unhandled structure is reached, it should print a <todo> XML tag. Other than that, of course patches welcome — that said, Maxime de Roucy already contributed a patch to the DOC part of mso-dumper, thanks! :-)


Hackweek 9

Estimated read time: 6 minutes

Last week was Hackweek at SUSE — below is a quick summary on what experiments did I do during that timeframe.

lcov

I did some experiments with using lcov on the LibreOffice codebase. The goal is to have a quick iteration, so you can see the current coverage of a file or a directory, select a method that is not yet tested, add a test for it, and "test" the test by checking if the coverage indeed got improved. As a first step, I tried this out on the Writer RTF import:

cd writerfilter
touch source/rtftok/*
make -sr -j8 gb_GCOV=YES <1>
cd ../sw; make -sr -j8 CppunitTest_sw_rtfexport CppunitTest_sw_rtfimport <2>
lcov --directory workdir/unxlngx6/CxxObject/writerfilter/source/rtftok/ --capture --output-file libreoffice.info <3>
genhtml -o coverage libreoffice.info <4>
  1. rebuild selected files with lcov options

  2. run the tests

  3. extract coverage information to a single .info file

  4. generate some nice HTML output from the .info file

Note
lcov had problems with gcc-4.7, fully updated openSUSE 12.2 or 12.3 is known to work.

There is a script available to make the above a bit more automated.

The speed of the above depends on the amount of code needing a rebuild + the number of tests, but it should not take more than a minute.

E.g. I noticed the bookmark import code isn’t tested, added a test for it, and that indeed improved the line coverage of rtfdocumentimpl.cxx: 84.1% → 85.0%.

A next area I wanted to test is the Writer RTF export. Let’s pick something in rtfattributeoutput.cxx… StartURL() is not tested, so a hyperlink testcase should help. Indeed it did: 50.2% → 52.0%.

Last, but not at least, thanks to Norbert Thiebaud, who added gb_GCOV to gbuild.

gdb pretty-printers

Then I experimented with improving our Writer gdb Python pretty-printers. One annoying shortcoming was the lack of handling uno::Reference<text::XTextRange>. Imagine one searches for a bug related to table import for DOCX or RTF. One idea is to check the arguments of the convertToTable() method call. The first argument is a 2D array of XTextRange pairs, that describe what will be the input for cell contents. So if you want to check the first cell, you do something like this:

(gdb) b DomainMapperTableHandler.cxx:798
(gdb) r
(gdb) print (*m_pTableSeq)[0][0]
$1 = uno::Sequence of length 2 = {uno::Reference to (XInterface) 0x1a73648, uno::Reference to (XInterface) 0x1a77f68}
(gdb) print (*m_pTableSeq)[0][0][0]
$2 = uno::Reference to (XInterface) 0x1a73648
(gdb) print (*m_pTableSeq)[0][0][1]
$3 = uno::Reference to (XInterface) 0x1a77f68

Not that helpful. Here is how one could work it around:

(gdb) print (*m_pTableSeq)[0][0][0]._pInterface->m_pImpl->m_pMark->m_pPos1
$4 = boost::scoped_ptr SwPosition (node 10, offset 0)
(gdb) print (*m_pTableSeq)[0][0][1]._pInterface->m_pImpl->m_pMark->m_pPos1
$5 = boost::scoped_ptr SwPosition (node 10, offset 20)

But this is not something anyone will remember. After adding a few new pretty-printers, now it’s like this:

(gdb) print (*m_pTableSeq)[0][0]
$1 = uno::Sequence of length 2 = {uno::Reference to (SwXTextRange *) 0x1a72b98, uno::Reference to (SwXTextRange *) 0x1a773b8}
(gdb) print *(*m_pTableSeq)[0][0][0]._pInterface
$2 = (SwXTextRange) SwXTextRange sw::UnoImplPtr SwXTextRange::Impl = {mark = sw::mark::IMark = {pos1 = boost::scoped_ptr SwPosition (node 10, offset 0), pos2 = empty boost::scoped_ptr}}
(gdb) print *(*m_pTableSeq)[0][0][1]._pInterface
$3 = (SwXTextRange) SwXTextRange sw::UnoImplPtr SwXTextRange::Impl = {mark = sw::mark::IMark = {pos1 = boost::scoped_ptr SwPosition (node 10, offset 20), pos2 = empty boost::scoped_ptr}}

Technically, it would be possible to make print (*m_pTableSeq)[0][0][0] work as well, but for a larger class without a pretty-printer that would result in multiple pages of output. Anyway, _pInterface is the same for all UNO objects, so something that is not too hard to remember.

An other improvement is the XTextCursor pretty-printer. Example usage: debugging of the commented text range ODF import. Before:

(gdb) b txtfldi.cxx:559
(gdb) print *rHlp.GetCursor()._pInterface->m_pImpl->pRegisteredIn->m_pMark
$1 = SwPosition (node 9, offset 4)

After the new pretty-printers one doesn’t have to type that much:

(gdb) print *rHlp.GetCursor()._pInterface
$1 = (SwXTextCursor)
    SwXTextCursor sw::UnoImplPtr SwXTextCursor::Impl = {registeredIn = SwModify = {point = SwPosition (node 9, offset 4), mark = SwPosition (node 9, offset 4), next = 0x1a28b88, prev = 0x1a28b88}}

RTF filter text frame rework

Finally, I experimented with reworking the textframe code in the RTF filter. In short, the motivation is to bring the RTF filter in sync with the OOXML one, which can nicely import and export text box gradients. To get there, there are 3 different problems to solve:

  1. The RTF import filter currently imports rectangle and textbox shapes as drawinglayer rectangles, even if they have some text inside. Just like the OOXML import filter, we would better import these shapes as Writer textframes, as long as they contain some text.

  2. The RTF export writes Writer textframes as old-style Word frames, not as text box shapes. This should be changed, as the old syntax doesn’t support gradients, and in general both the DOC and DOCX export filters already export new-style Word frames, so there is no reason why the RTF filter would not do the same.

  3. Once all the above is done, add support for gradients in the RTF filter, in a similar way OOXML filters were already improved to handle gradients.

  4. Once this all is done, add new testcases to cover the new code.

First I had hacked on #1, sadly Writer textframes and drawinglayer rectangles don’t share the exactly same UNO API, like drawinglayer has TextWritingMode and a Name property, Writer textframes have a WritingMode property instead, and additionally they implement the XNamed UNO interface, etc.

Then I switched to #3 — there I managed to reuse our existing VML import to do the hard work: the RTF tokenizer reads the RTF shape properties, then constructs the same VML model what is normally built from v:fill and v:shadow XML elements inside DOCX files, finally the VML import does the mapping of Word’s gradient concept to the Writer gradient concept.

At the end of the week I also hacked on #2 and #4 — and while I did so, I noticed two more interesting details of Word’s new-style RTF textframe markup:

  • The bad news: Writer supports having different top/left/bottom/right borders, RTF still just supports the concept of a single line around the textframe.

  • The good news: old-style RTF frames didn’t support different left/right or top/bottom external margins, but Writer does — so now using the new syntax, this is exported properly.

git

Unrelated to the above, I fixed an annoying git bug, when one tried to cherry-pick multiple commits at the same time, and copy&paste went wrong, the "unrecognized" arguments were just silently ignored. Now one gets an error instead.

docs.libreoffice.org

In parallel to the above, Thorsten was kind enough to explain how to update docs.libreoffice.org: The new output is generated using doxygen 1.8, it contains a bit more eye-candy. E.g. notice the new foldable subsections here. ;-)


LibreOffice Writer now supports graphic bullets in its DOCX/RTF filters

Estimated read time: 1 minutes

If you ever tried to use graphical bullets in Writer (Format → Bullets and Numbering → Graphics), you may have noticed that only the ODF filter can load and save such a numbering. This is now improved a lot. Motivated by seeing this is now handled in the binary DOC filter, I now added support for this also to the DOCX and RTF import and export filters. If you want to play with this feature, core.git also contains a DOCX and an RTF sample as well.


git-review

Estimated read time: 2 minutes

LibreOffice started to use Gerrit for code review, and while occasional contributors can submit patches manually, in case one does many reviews, it’s handy to use a dedicated tool. In core.git, we have logerrit, but that’s not advised for regular reviewers, either, git-review is recommended instead.

So I looked into git-review. The good news is that it’s packaged already for most distributions, e.g. a simple

zypper in python-git-review

on openSUSE installs it.

I wanted to use this tool for two tasks:

  • Submitting changes to Gerrit: git review -R could do that. -R prevents automatic rebase, so a test build won’t fail because your patch is based on an already broken commit. The other good thing is that you don’t have to remember where to submit: both the master and libreoffice-4-0 branches contain a .gitreview file that contains the necessary server / branch information.

  • Cherry-picking changes from Gerrit: I found no option for this. A cherry-pick command is generated on the web interface, but it’s more complicated than a simple <some command> <number of the change>. So I submitted this change to git-review itself, the next release will be able to do git review -x <number of the change>.

Probably the browser interface is still the best to comment (especially inline comment) and approve changes, though David even submitted a proof of concept patch for that as well.

Finally, let me just clear two myths:

  • If you use Google for OpenID login, you can have multiple OpenID accounts associated with your Gerrit login, so it’s not a problem (first I thought it is) if you use one email for Gerrit and an other one for accessing other Google services.

  • Somewhere I read that the stock LibreOffice hooks conflict with git-review: nope, git-review didn’t touch the hooks, you can use the tool without corrupting them in any way.


LibreOffice Writer now supports gradients in text frame backgrounds

Estimated read time: 1 minutes

When you create a rectangle or text frame in Writer, you have two choices. You can use the draw toolbar to create a drawinglayer rectangle, and you can also insert a text frame. The drawinglayer shapes are shared between the LibreOffice applications, and already supported having not only a bitmap or a color but a gradient or a hatch as a background. The benefit of Writer text frames is that they can contain anything a normal Writer document can — think of columns, tables, etc. These features are not supported by drawinglayer rectangles.

So till now you had to decide what to pick, but it wasn’t possible to have both. LibreOffice 4.1 makes this situation better. Now it’s possible to have gradient backgrounds in Writer text frames as well:

The nice thing is that this feature was already supported by ODF, just not by Writer, so no such paperwork was needed this time. Also the OOXML filters are updated. As I already stated in this comment, the binary DOC and RTF filters are not yet touched regarding this feature — though I already looked into the RTF one, and have some idea what rework is needed there first.


FOSDEM 2013

Estimated read time: 1 minutes

We spent the last weekend in Brussels, at FOSDEM 2013. Outside attending great talks, I most enjoyed meeting people I haven’t met in person before, in no particual order:

Also fixed fdo#48440, fdo#58646 and fdo#59419 during less-interesting talks. ;-)

Additionally, during the last day we had time for some site-seeing, some pictures are here. Slides of other LibreOffice talks are also available.


lcov

Estimated read time: 2 minutes

There are multiple strategies how to add testcases for code that sort of works, but has no or too few tests. One approach (that works quite well in LibreOffice, for example) is to just add tests for new code, and there the test is "good", if it passes, but it fails if you revert the corresponding real change.

An other approach to avoid duplicated tests is to use a tool like lcov, that can perform line or function coverage analysis for you, so a test is "good" if it increases the coverage. I wanted to look into this later approach for LibreOffice, but I decided it’s more fun to try this out for a smaller project first. That’s when adding testcases for BitlBee’s Skype plugin came into my mind.

The problem there is that manual testing typically includes multiple online Skype clients and an IRC client as well, and such tests are extremely unreliable. So I thought: if I’m able to mock both the interactive IRC and Skype clients, then it’ll be easy to test the C Skype plugin itself, even for very special scenarios (like changing a groupchat topic in the middle of inviting somebody to a groupchat or similar).

So here is the result looks like:

 skyped mock file   +--------+         +---------+   pyexpect mock file
------------------> | skyped | <-----> | bitlbee | <--------------------
                    +--------+   TCP   +---------+

For skyped, the exact traffic is recorded and played back later; for BitlBee, only the outgoing traffic is exact, for the incoming traffic pyexpect allows just patterns (to allow tolerance for not interesting changes). Once the framework was available, it was quite easy to add testcases: I already have 70%+ coverage, and I think approaching the 100% function coverage is realistic. :-)

What was also interesting is that it turned out the latest upstream lcov release is not compatible with gcc-4.7, but the necessary patches are now integrated, and the next upstream release will work out of the box.

The BitlBee mock files can be found here. Given that there are now instructions to do similar analysis for LibreOffice as well, I hope to look into increasing test coverage for the classes I maintain as well.


mdadm upgrade

Estimated read time: 2 minutes

Even though I spend little of my free time with sysadmin stuff these days, this came up recently. A few years ago I hit an issue about mdadm creating too new metadata that wasn’t handled by the installed kernel, so I remembered to use --metadata 0.90 when creating a new array. Additionally, I preferred using cfdisk for partitioning.

It turns out this caused quite some grief when it came to grub2, I wrote about this earlier — that was about the theory, in a VM. This is about the practice. In practice, gparted turned out to be too risky, and I choose the following approach to repartition the hard drives (so there is enough space for grub2) and upgrade the mdadm metadata.

First, I broke the mirror by removing one leg of the RAID1 array:

mdadm --manage /dev/md126 --fail /dev/sdd1
mdadm --manage /dev/md126 --remove /dev/sdd1

Then I created a new array (with a single leg) with the new metadata and formatted it:

fdisk /dev/sdd
mdadm --create /dev/md125 --metadata=1.0 --level=1 --assume-clean --raid-devices=2 missing /dev/sdd1
mkfs.ext4 /dev/md125

Finally I copied over the live system:

mkdir /mnt/md125
mount /dev/md125 /mnt/md125
rsync --delete -avxP / /mnt/md125
umount /mnt/md125

The rest was easy: I booted a livecd to do the rsync once again (taking a few minutes only), and once the system was running from the new array, added the leg of the old array to the new one as well — and that’s it.


Recent contributions

Estimated read time: 1 minutes


Zero RTF Regressions?

Estimated read time: 1 minutes

I think the first attempt to track LibreOffice RTF Writer regressions (bugs not presenting in some earlier versions) was in this mail. That started with 14 bugs, and of course while I fixed a few, new ones were added as well. I guess this is mostly due to testing work, since new fixes are usually covered by unit tests, so re-introducing the same problems nowadays is a bit more work.

I remember I was down to one regression a few months ago, but we still had performance problems, which got solved a few weeks ago, so I had the idea that I want to go down to zero during the holidays. It seems today I finally managed to do so — bugs tagged as rtf_filter and regression are gone, thanks everyone who helped! :-)

For the reference here are the queries: RTF regressions, fixed RTF regressions, Writer regressions.

Now that the list is empty, feel free to tag more bugs as rtf_filter from the long Writer list when needed.

Update: the list is now empty again, as of 2014-11-24, for the 4.4 release. ;-)

© Miklos Vajna. Built using Pelican. Theme by Giulio Fidente on github.