Reverse engineering the Quark Xpress file format

by Frans Faase

In the periode from February 2001 till May 2002, I have spend many hours reverse engineering the Quark Xpress Binary File Format as used by "http://www.quark.com/" Quark Xpress a widely used DTP program. I have decided to bring my results in the public domain in the form of a source distribution under the GNU General Public License, with the explicit amendment that any additional discoveries about the Quark Xpress file formats that are made with the use of this program, are also made public under the GNU General Public License.

If you have downloaded the files published on this web page,
and are actively using them, I would be very happy to know this.

Although the program can read all the files I needed to read, it by no means is complete, and could possibly crash on any other file. The biggest limitation is that the program can only read files produced by some earlier MAC versions of Quark Xpress. Files saved by the windows version use a different byte order for the integers. (On February 22, 2001, I already released a very first version of the program, which was able to read some Windows files.)

At the moment I have only very limited time available for supporting anyone continueing the reverse engineering of the Quark Xpress formats. Please do not ask me any questions about the code, because if you are not able to read the code as it has been provided, you very likely will not be able to reverse engineer the binary file format any further. (Read: Requirements.) If you want to continue working on the Windows file formats, please read the last section on the page.

For professional conversions of Quark Xpress to XML, I point to the following resources:

"http://www.dclab.com/QuarktoXML.html" Data Conversion Laboratorium.
"http://www.pcipage.com/" iCPS.

The sources

You can download the sources in a single zip file from here. The sources compile with the Cygnus gcc compiler in the Cygnus unix under Windows environment. To build the program, simply compile the file scan.cpp as it includes all the other sources.

Please note that the files CQXDoc.cpp and CDatabase.cpp are made with the cls2cpp program from the file CQXDoc.cls and CDatabase.cls files. Please do not edit these .cpp files, but generate them from the .cls files. You could use the following shell script for building the program:

#!/bin/sh
make cls2cpp
cls2cpp CDatabase
cls2cpp CQXDoc
gcc -g -Wall scan.cpp -o scan.exe

Of course, you could also write a small make file for doing the job. I didn't take the effort to save the half second to run the program each time.

Below, a short description of the files found in the source distribution is given.

The file `scan.cpp`

The main file in the source distribution is the file scan.cpp. This file includes all the other files. No header files have been used. With current day computers, it is often much faster to simply include all the sources into a single file, then to compile all the C++ files into separate object files, and having to link them together. Also for larger projects, where most of the time is spend on reading large number of include files, this could be a much faster approach, than the traditional way of compiling and linking.

The file `stddef.c`

Just a collection of handy functions and macros that I often use in my C/C++ programs.

The files `CBuf.cpp` and `CReadBuf.cpp`

These files implement a number of classes to read data from a buffer. The class CBuf implements the buffer, and the classes CReadBuf and CReadButWithBlocks implements procudures to read various kinds of values from a CBuf buffer.

The files `MMFile.cpp` and `MMFileDummy.cpp`

The file MMFile.cpp implements a persistent store (database) making use of a Memory Mapped File. The file MMFileDummy.cpp implements a replacement for MMFile.cpp which is not persistent. The scan.cpp provided in the distribution uses non-persistent implementation. If you want to use the persistent implementation, you might want to change the filename used in the open method, and increase the size of the store. The program may crashs in case of an overflow.

The files `CQXDoc.cls` (and `CQXDoc.cpp`)

This defines the classes for storing the logical structure of a Quark Xpress documents including many of it style definitions.

It also contains the class CTextAccessor which is an accessor to formatted text from a text fragment with all its formatting instructions. For an example how to use it, see the file DumpQXDoc.cpp.

It also contains the class CTextOnFramesAccessor which could be used to walk over the whole text of a book. There are no examples of it use given in the code distribution, but you should be able to figure out how to use it by yourself. It also contains some elementary parsing methods.

The files `CDatabase.cls` (and `CDatabase.cpp`)

This defines a few classes for organizing some Quark Xpress files into books and maintaining a collection of books.

The file `scanQXDoc.cpp`

This contains the actual scanner. It makes some heavy use of some tricky defines. The idea is that the code describes the grammar, but in case of an error, the parsing jumps back to a certain point and repeats the parsing, but now with dumping information. This makes it easier to figure out what went wrong. The system does not always work perfect.

The file `FrameGeom.cpp`

This file contains some code for determining the natural reading order of the frames. It also deals with nested frames. The algoritm used is probably not perfect, but it served my purpose well. After the main routine has been called, all frames found in the documents of a "book" are linked through first_frame_reading_order and next_reading_order.

The file `DumpQXDoc.cpp`

This file contains some routines to dump the information to file either plain text or HTML, but it could be modified to dump it to any format you want. This is more an example, than a working piece of code. A lot of intelligence is in the class CTextAccessor from the file CQXDoc.cls.

Latest version for Window file formats

For those who want to continue the work on reverse engineering the Windows file formats, I hereby also give access to the latest version of the program which can read some file produced by Quark Xpress 4.1 for Windows. I have not been able to date the version. It is definitely later than May 22, 2001. I think it is from earlier this year, as it makes use of an early implementation of the class CBuf. Actually, this version was produced on October 12, 2002, when I made some last modification to make it generate an XML file that can be viewed with IE!

I am not very proud of this program, because the code contains a lot of rubbish. Please do not look at it, if you are not an expert programmer. At some points it might even cause for more confusion than be of some help. (I am affraid it does contain some amouth of dead code.) When run, it produces a lot of debugging output on stdout. I usually redirect this to a file. A file with the extenstion .xml will be generated, if the program does not crash, which I am affraid is very likely, if you feed it an arbitrary Quark Xpress 4.1 document.

If you want to contribute to the reverse engineering of the Quark Xpress file formats, do not develop this program further, but rather make modifications to the latest source base. I am not willing to publish any modifications to the qq.cpp program.

My life as a hacker | How to crack a Binary File Format

Reverse engineering the Quark Xpress file format

The sources

The file scan.cpp

The file stddef.c

The files CBuf.cpp and CReadBuf.cpp

The files MMFile.cpp and MMFileDummy.cpp

The files CQXDoc.cls (and CQXDoc.cpp)

The files CDatabase.cls (and CDatabase.cpp)

The file scanQXDoc.cpp

The file FrameGeom.cpp

The file DumpQXDoc.cpp