Reverse engineering the Quark Xpress file format
by Frans Faase
In the periode from February 2001 till May 2002, I have
spend many hours reverse
engineering the Quark Xpress Binary File Format as
used by "http://www.quark.com/" Quark Xpress
a widely used DTP program.
I have decided to bring my results in the public
domain in the form of a source distribution under
the GNU General Public License,
with the explicit amendment that any additional
discoveries about the Quark Xpress file formats that are
made with the use of this program, are also made public under
the GNU General Public License.
If you have downloaded the files published on this
web page, and are actively using them, I would be very
happy to know this. |
Although the program can read all the files I needed to read,
it by no means is complete, and could possibly crash on any other
file. The biggest limitation is that the program can only read
files produced by some earlier MAC versions of Quark Xpress.
Files saved by the windows version use a different byte order
for the integers. (On February 22,
2001, I already released a very first version of
the program, which was able to read
some Windows files.)
At the moment I have only very limited time available
for supporting anyone continueing the reverse engineering
of the Quark Xpress formats. Please do not ask me any
questions about the code, because if you are not able to
read the code as it has been provided, you very likely will
not be able to reverse engineer the binary file format any
further. (Read: Requirements.)
If you want to continue working on the Windows file formats,
please read the last section on the page.
For professional conversions of Quark Xpress to
XML, I point to the following resources:
- "http://www.dclab.com/QuarktoXML.html" Data Conversion Laboratorium.
- "http://www.pcipage.com/" iCPS.
You can download the sources in a single zip
file from here. The
sources compile with the Cygnus gcc compiler
in the Cygnus unix under Windows environment.
To build the program, simply compile the file
scan.cpp as it includes all the other
sources.
Please note that the files CQXDoc.cpp
and CDatabase.cpp are made with the
cls2cpp program
from the file CQXDoc.cls and
CDatabase.cls files. Please do not
edit these .cpp files, but generate
them from the .cls files. You could use
the following shell script for building the
program:
#!/bin/sh
make cls2cpp
cls2cpp CDatabase
cls2cpp CQXDoc
gcc -g -Wall scan.cpp -o scan.exe
Of course, you could also write a small make file
for doing the job. I didn't take the effort to save
the half second to run the program each time.
Below, a short description of the files found in the source distribution
is given.
The file scan.cpp
The main file in the source distribution is the file
scan.cpp. This file includes all the other
files. No header files have been used. With current day
computers, it is often much faster to simply include all
the sources into a single file, then to compile all the
C++ files into separate object files, and having to link
them together. Also for larger projects, where most of
the time is spend on reading large number of include files,
this could be a much faster approach, than the traditional
way of compiling and linking.
The file stddef.c
Just a collection of handy functions and macros that I often
use in my C/C++ programs.
These files implement a number of classes to read data
from a buffer. The class CBuf implements the buffer, and
the classes CReadBuf and CReadButWithBlocks implements
procudures to read various kinds of values from a CBuf
buffer.
The files MMFile.cpp and MMFileDummy.cpp
The file MMFile.cpp implements a
persistent store
(database) making use of a Memory Mapped File.
The file MMFileDummy.cpp implements a replacement
for MMFile.cpp which is not persistent. The scan.cpp
provided in the distribution uses non-persistent implementation.
If you want to use the persistent implementation, you might
want to change the filename used in the open method,
and increase the size of the store. The program may crashs in case
of an overflow.
The files CQXDoc.cls (and CQXDoc.cpp)
This defines the classes for storing the logical structure
of a Quark Xpress documents including many of it style
definitions.
It also contains the class CTextAccessor which is
an accessor to formatted text from a text fragment with all
its formatting instructions. For an example how to use it,
see the file DumpQXDoc.cpp.
It also contains the class CTextOnFramesAccessor
which could be used to walk over the whole text of a book.
There are no examples of it use given in the code distribution,
but you should be able to figure out how to use it by yourself.
It also contains some elementary parsing methods.
The files CDatabase.cls (and CDatabase.cpp)
This defines a few classes for organizing some Quark Xpress
files into books and maintaining a collection of books.
The file scanQXDoc.cpp
This contains the actual scanner. It makes some heavy
use of some tricky defines. The idea is that the code
describes the grammar, but in case of an error, the
parsing jumps back to a certain point and repeats the
parsing, but now with dumping information. This makes
it easier to figure out what went wrong. The system
does not always work perfect.
The file FrameGeom.cpp
This file contains some code for determining the natural
reading order of the frames. It also deals with nested
frames. The algoritm used is probably not perfect, but it
served my purpose well. After the main routine has been
called, all frames found in the documents of a "book" are
linked through first_frame_reading_order and
next_reading_order.
The file DumpQXDoc.cpp
This file contains some routines to dump the information
to file either plain text or HTML, but it could be modified
to dump it to any format you want. This is more an example,
than a working piece of code. A lot of intelligence is
in the class CTextAccessor from the file
CQXDoc.cls.
For those who want to continue the work on reverse engineering
the Windows file formats, I hereby also give access to the
latest version of the program which can
read some file produced by Quark Xpress 4.1 for Windows. I have
not been able to date the version. It is definitely later than
May 22, 2001. I think it is from
earlier this year, as it makes use of an early implementation
of the class CBuf.
Actually, this version was produced on October 12, 2002, when I
made some last modification to make it generate an XML file that
can be viewed with IE!
I am not very proud of this program, because the code contains
a lot of rubbish. Please do not look at it, if you are not an
expert programmer. At some points it might even cause for more
confusion than be of some help. (I am affraid it does contain some
amouth of dead code.) When run, it produces a lot of debugging
output on stdout. I usually redirect this to a file.
A file with the extenstion .xml will be generated, if
the program does not crash, which I am affraid is very likely,
if you feed it an arbitrary Quark Xpress 4.1 document.
If you want to contribute to the reverse engineering of the Quark
Xpress file formats, do not develop this program further, but rather
make modifications to the latest source base.
I am not willing to publish any modifications to the qq.cpp
program.
My life as a hacker |
How to crack a Binary File Format