1. Abstract

For the ooxml Word import filter, a new filter framework is developed that has better import quality as the old separate filters, and also shares quite some code. Having the RTF import use that framework would be hugely useful; both binary doc and rtf are very similar in structure, and could have tokenizers plugged into the filter framework (and thus being handled by the same high-level filter code). The doc tokenizer is already worked on, so a RTF tokenizer would be a an awesome feature - and this summer I would like to work on this.

2. Content

2.1. Detailed Description

2.1.1. Full description

LibreOffice Writer’s import functionality basically has two forms. The old style ones are C++ filters are subcalled from the Reader class (for example: RtfReader for RTF, AsciiReader for ASCII, HTMLReader for HTML). The new style ones are implemented as an UNO component (for example: the writerfilter/source/ooxml directory for DOCX). There are currently two general problems with the RTF import filter:

  • It is an MS Word format, but it does not use any shared code from the filter framework.

  • It has several shortcomings, almost none of the new export features listed in sw/source/filter/ww8/README-rtf.txt are handled.

Porting RtfReader to use the new filter framework has been started (see writerfilter/source/rtftok/), but is not yet done. This summer I would like to finish it.

Additionally - as time permits, after porting is done - I want to pick up features from RTF specification which are not yet supported by LibreOffice’s RTF import filter and add support for them.

2.1.2. Benefits

When the project is completed, users who use the RTF import filter can have a better one: from a technical view it will use a more modern API, from a practical view, it’ll be improved in general.

2.1.3. Motivation

I got the idea of working on this because when I did the RTF export filter, I had to use Microsoft Office to test the result. Also, whenever a user saves a document in RTF, then opens it, all the benefits of the new exporter goes away, as the importer does not understand those new features. This gave me a feeling that improving RTF support would be a good idea in general.

2.1.4. Implementation design

Regarding implementation, I want to take the DOCX import filter as an example (without over-complicating it!) and implement the new RTF filter in a similar way - but of course heavily based on the current RtfReader implementation where it makes sense.

I don’t know exact details, because I haven’t (yet) read the started RTF tokenizer nor the relevant part (writerfilter) of the LibreOffice API, but I expect to read and understand the code from the sw/source/filter/rtf directory and I want to place the new RTF code under writerfilter/source/rtftok, where the DOCX one already is.

I’m aware that it would be possible to use XSL transformations to create a new RTF filter as well, but not ignoring performance aspects and my (quite weak) XSLT knowledge, I decided to create a C++ filter.

2.2. Implementation timeline

2.2.1. Milestones

  • Update the development environment to latest master on which I will base my work, decide where do I publish my code, work out other infrastructure details. (Till 2011.05.30.)

  • Understand how the already existing DOCX filter work: the tokenizer and the domain mapper. (Till 2011.06.06.)

  • Understand how the current RTF import filter works. (Till 2011.06.13.)

  • Decide how do I test my code, possibly write testcases and/or work out a mechanism to compare the new filter to the old one. The more automated way, the better. Partly to ensure I do not break anything, and partly to provide an objective method to measure my progress. Find out what features are missing from the domain mapper to be able to port the RTF filter without introducing regressions. (Till 2011.06.20.)

  • Do the actual porting: completing the tokenizer and extending the domain mapper to provide the features needed for the new RTF import. At first round I want to reuse code from RtfReader where possible and reach a point where the new filter is as much good as the old one was. (Till 2011.07.04.)

  • Review what was reached, if there are differences which are decided to be good ones, document them. Decide what to work on next, etc. (Till 2011.07.11.)

  • Fix bugs, fine-tune, update the code based on suggestions from other LibreOffice developers, implement features from the specification if time permits. (Till 2011.08.08.)

So I would like the first version of the new RTF filter ready by the time of mid-term evaluation, then I can work out the minor problems, write documentation and if time permits - implement new features from the specification in the second half of the summer. I used concrete dates so that I can be checked easily if I’m on track, but of course I may finish with a given part a bit earlier or I may have a little delay.

2.2.2. Exams, holidays, etc.

I’m a student in Hungary, and here the exam period is between 2011.05.23 and 2011.06.20. Sadly this has a big overlap with the Summer of Code, which starts on 2011.05.23. I’m aware of this and last year I managed to get over this as well. I plan to work on my project 40 hours a week, which permits to spend entire days on my project, then work nothing a few days before the exam. I don’t know the exact date, but probably after the mid-term evaluation I’ll have a week of holiday. I think once I returned I can work more productive on my project with a fresh mind, compared to going nowhere during the whole summer.

2.2.3. Communication

My experience is that IRC is handy for quick questions, but sometimes email is better to communicate in an asynchronous way. I also like writing each day about what did I do in a blog-like form. If this is OK for my mentor, then I would like to use these for this project as well.

2.2.4. Invested time before, during and after SoC

I’m doing trivial git-related (or other areas where I’m familiar with the codebase) small fixes before the SoC starts, just to get used to the new build system, etc. I plan to work 40 hours a week during the SoC. (Well, basically. I do not use a timer, the average is about something like this.) I don’t know the future, I hope I can do some contribution after SoC as well.

2.2.5. Future work

I don’t know anything about it so far, but I’m sure there will be new ideas during the implementation of the project. I’m sure I’ll at least document these ideas once the project is finished, so I or others can work on them after the end of SoC.

2.3. Relevant knowledge

2.3.1. Experience of LibreOffice

As every average Linux user - I’m using LibreOffice on a daily basis, to read/write/convert any non-ascii document (well, except PDF). I use Writer / Calc / Impress whenever the document is not long/technical enough that using Latex worths considering.

2.3.2. Experience in the project specific area

I expect that I’ll need to write C++ code during the project. I contributed to various projects written in that language, most notably I added a new RTF exporter to Go-oo last summer, which was implemented in about 8000 lines of C++ code.

2.3.3. Used toolchains and platforms

I’m a Linux guy, I used OS X for a long time (but not currently), mainly for testing purposes as well. I do not really use Windows, but I have one installed in a virtual machine (can be useful when I want to see what MSO exports in RTF to feed it to the RTF importer :-) ). I’m familiar with gcc and make, I used autotools for other projects previously as well. I’m familiar with git as well, but I’m quite new to LibreOffice’s new build system (though right now writerfilter uses dmake, it seems).

2.3.4. Recent interaction with the LibreOffice community

I’m hanging around on #libreoffice-dev, and I’m subscribed to the developer mailing list.

2.4. About Me

2.4.1. Where do I work/study and my interests

Education: Completing an M.Sc. degree in Computer Science (to be finished this year, December - final exams in January) at Budapest University of Technology and Economics. I work for SZTAKI (http://www.sztaki.hu/?en, part time - 1 day in a week, since 2004). I have a page with a few minor projects: http://vmiklos.hu/projects/

A have a page with links to patches I contributed to other FOSS projects: http://vmiklos.hu/portfolio/ Mostly minor code and/or documentation fixes. I contributed a few more complex patches to the pacman package manager, the bitlbee IM gateway, Git, SWIG and OpenOffice.org (these are C and C++ projects). I’m generally a bash, C/C++ and Python guy. I know some perl/Java/XSL as well.

2.4.3. IRC name

vmiklos

3. My easy programming task