Cracking the DWG R13 file format
This page gives an account of an attempt to
crack the DWG R13
file format, which is the AutoCAD native drawing format.
I already cracked most of the R12 format,
and heard that the
R13 format is very different, so I decided to have a look
myself, and see whether I can make something out of it.
All software referenced in this page is copyrighted by
Frans Faase
Introduction
The most important thing for cracking a file format is having the
right tools. I prefer to use C programs which I wrote myself for
finding the internal structure of a binary file. And, of course,
you need a nice Hex viewer, such as the one of Norton Commander
(or Midnight Commander, if you
are working with Linux, as I am doing right now).
It is also important to have as much as possible information
available. AutoCAD does have an alternative format for storing
files, the DXF format, which is (almost) completely specified.
This realy makes my job easier. I have got myself three DWG files
with there DXF counter parts, to start with.
This time I have decided to write a program that analyzes a
DWG R13 file by making use of the contents of the DXF file. I also
have decided to use this as an example of how binary file formats
can be decoded (be means of keeping an account, which is what you
are reading now).
Lets see how far we will get.
Dec 23, 1995: Making a DXF low level reading program
I have to do this job again, although I did it in the past, but
sadly that code is now owned by someone else. A good chance to
try it in a new way. I am going to make a new set of low-level
file reading procedures based on the ideas in
bfc.c.
I wrote the files:
- common.h and common.c:
some common definitions and procedures.
- fio.h and fio.c:
low-level file reading procedures.
- rd_dxf.h and rd_dxf.c:
low-level DXF reading procedures for binary DXF files.
- scan.c: the main program, for the moment, which is almost
empty.
I learned some interesting things about DXF R13, which appears
to be rather different from DXF R12. My documentation does not give
all the details needed to read the format. So, now I find myself
decoding the DXF R13 format.
the first thing that I discovered, was that the R13 does not any
more use a single byte (with 255 as an escape value, for larger
values) as for R12, but a word for the group code.
From some of the new group codes, I guessed the type, based on the
general pattern that ?00 to ?09 are string and ?70 to ?78 are integers,
and most other doubles. But then I got stuck with group code 340,
which does not seem to be a integer nor a double.
Enough for today.
Dec 24, 1995: The 310 to 369 group codes in DXF R13
The Customization Guide, on page 543 gives the following information:
- 310-319: Arbitrary binary chunkd
- 320-329: Arbitrary binary handles
- 330-339: Soft pointer handle (specifies pointer to other object in drawing)
- 340-349: Hard pointer handle (specifies pointer to other object in drawing)
- 350-359: Soft owner handle (specifies ownership to other object in drawing)
- 360-369: Hard owner handle (specifies ownership to other object in drawing)
Note that the 170-178 and the 270-278 groups code are ommited in this
table; they appear in the group code summary for the DIMSTYLE,
see page 561 to 563!
I decided to change the rd_dxf.c program, such that it gives
the offset of the current group code, so that I can looked it up
with my hex-viewer. I have to print it in hex, maybe the group code
also. The group code 340, is followed by the string `10\0',
and then 13h 01h, which
looks like a groupcode (256 + 16 + 3 = 275). Lets assume that the
group codes 320 to 369 are stored as string.
Funny, the group codes 280 to 289 (8-bit integer values) appear to be
stored in two bytes. Okay, now I can correctly scan the three files that
I am using as a start. I also added some argument scanning to the main
program. My files are now:
- common.h.1 and
common.c.1
- fio.h.1 and
fio.c.1
- rd_dxf.h.1 and
rd_dxf.c.1
- scan.c.1
These, and all other source files can be found
here.
Collecting interesting strings and doubles
The next step is to collect those strings and doubles from the DXF file,
which probably also occur in the DWG file. For each of these we will
store the positions (= element number) where they occured in the
DXF file.
I took all the strings with group codes 1 to 9 (included), but it
turns out that there are many with 9 that are not interesting. Lets
leave those out.
I made the scanning part, and decided to print the strings and
doubles that are not found the DWG file, after we have scanned them.
This resulted in the file scan.c.2.
If appeared that most interesting strings are found, but that many
doubles are not found, which surprices me.
Comparing the output of the program of two DWG files, showed up
that some long word values differ. Probably again the pointers
that tell where certain parts of the file start, just as in the
DWG R12 format.
Dec 25, 1995: Finding some pointers
I added some extra field to each output line generated by scan.c
which tells the starting position of this line in the file. Then
I analyzed the difference between the output of two of the
drawings. I found possible pointers at: 13, 30, 35, 44, 48, 53 and 95.
Lets check what these are pointing at, and assume that these are
borders of certain sections. Now search for doubles and strings
that are found in the DXF file, and see whether these can give us
some clue, about the meaning of certain sections.
The X-coordinate of $EXTMAX is at 335. At 249 there is a
$PEXTMIN value. The section starting with pointer at 95,
seems to have to do something with dim-styles.
Not much progress any more. Lets see if we can find more doubles
in the DWG file. (Maybe some transformations have been performed, which
make the values differ from those in the DXF file).
That does also not give us much. I feel stuck now.
What have they done, some kind of decoding (compression) or bit-shifting.
At the end of the file, there is a rather repeating part. There are
pointer in there to one and the same location, which might be the
start of the entities, as some of the tables seems to be located
more at the end of the file, as appears from the strings I found.
The DWG R12 format also had some pointers at the end, but most where
in the first part as well.
Bit-shifting was the right idea! Now I almost find all the doubles,
and those that I do not find look like to have a logical reason, why
they are missing (time-stamp and angles in degrees). This is a great
step, as it reveals alot of the structure. A good point to stop for
today.
For the following files there are newer versions:
fio.h.2,
fio.c.2, and
scan.c.3.
Dec 27, 1995: Finding at the entities (objects)
There is a small bug in scan.c: for a double, I did
skip the wrong amount of bytes, namely str_len.
(Maybe, I should concluded that most bugs I make are caused
by copying some code from somewhere else, in order to quickly
write something, and then forget to modify it correctly.)
For a LINE entity we have:
- 10-group, starting at 3-bit offset
- 2-bit value
- 20-group (starting at 5-bit offset)
- 4-bit value
- 11-group (starting at 1-bit offset)
- 2-bit value
- 21-group (starting at 3 bit offset)
- 20 bytes to the next 10-group of the second line.
In total 1 + 20 + 4 * 8 = 53 (35h) bytes. Maybe this occurs as a
length byte, just as in DWG R12. The closest is 2f.
I decided that it was beter to compare an empty DWG file, with one
only a single entity, to find out the borders. I got
myself some more DWG files, each containing a single line that has
entity with all kinds of different things set (color, line-type,
thickness and elevation).
(After some hours of comparing the output files:)
It seems that once an entity added to an empty drawing, a viewport
entity is being created.
The first position at which things start to differ is 4020.
I still feel that I am not making much progress.
Dec 28, 1995: Looking at the layers
After yesterday's futile attempts to do something with the entities,
I decided that I should start looking at the layers for a change.
It seems that many strings start with a byte, indicating its length.
After checking the positions of the (suspected) pointers (at 13, 30 ..),
I have to conclude that they might not be pointers, or at least
not from the beginning of the file.
I found some more information about the structure of the layer table:
Each record starts with a word specifying the length of the data,
followed by the date, which appears to be followed by a word CRC.
This appears to be a repeated pattern! Is this another key to how
the format is stored? Yes, it seems a rather consistent pattern.
It makes me happy to find such a structure, after some days of hopelessly
looking around, and finding nothing substantial. A good moment to
quit for the time being.
Dec 29, 1995: Continuing the search
I adapted scan.c in such a way that it checks the structure
I have found above. I let it start with the first position that I
have found. It appears to work, and the position at which it fails
is equal to the value given at position 44. This means that it
is probably a pointer. Maybe, one of the other pointers is giving
the start position of the records.
(And then I did something stupied, and overwrote this file, with an
older version. Luckly, I succeeded in recovering everything using
the more /dev/hda3 and search some strings.)
Jan 2, 1995: Continuing the search
I checked the beginning of the record chain, and concluded that it
is not determined by one of the pointers that I found so far. I decided
to determine it, based on the number 0x71d01767 that was found in all
files just before (10 bytes) the start of the record chain.
I decided to change the string matching based on the fact that all
string start with a byte indicating the length.
The first byte of the record seems to indicate the kind of the record.
The fourth to seventh look like to be long word. No, it seems that the
first two bytes contain the kind.
I have found the following:
- 40 40 - TEXT
- 41 00 - END
- 41 40 - BEGIN
- 41 80 - SEQEND
- 41 C0 - INSERT
- 42 80 - VERTEX
- 43 C0 - POLYLINE
- 44 80 - CIRCLE
- 44 C0 - LINE
- 46 C0 - POINT
- 4C 40 - VIEWPORT
- 4C C0 - LAYER
- 4D 40 - STYLE
- 4E 40 - LTYPE
- 50 40 - VPORT
- 50 C0 - APPID
- 51 40 - DIMSTYLE
From this it seems that the first two bit are always 01, and that the next
8 bits contain the kind. All these results are found in a new version of
scan.c.4.
Jan 9, 1995: mail from Jason Osgood
Today I received an email from Jason, telling that the R13 format will
change again for R13c4. He also suggested that I should contact
Robert McNeel. Jason wrote:
He says his firm will reverse engineer the format and update their DWG routines
once the format becomes stable. Robert is very much in favor of having the
DWG format in the public domain. (The money he charges for their library of
routines is to cover technical support and the like.)
Of course, I did send an email to Robert McNeel.
Jan 17, 1995
It has been some time ago that I have done something. Reini suggested
that they may use some kind of default C++ routines to save and
load the records.
I build the above table into the program.
Jan 18, 1995: checking an idea about AcDbRecord pointers
Yes my conclussion is right. The kind byte is followed by a length
byte, followed by a number of bytes containing the group 5 (or 105) string
just before the AcDb??Record found in a group 100 string. This
resulted in the following version for
scan.c.5.
Jan 19-21, 1995: developing the software
I have started to rewrite the software, for the purpose of analysing
the records. I have created to new files
s_dxf.h.1 and
s_dxf.c.1, which contain routines
for reading all the records stored in the DXF file. The records are
recognized by the use of a handle (a 5 or 105 group). Now I still
have to adapt scan.c such that it can analyse each record
with the information found in the DXF file about the record.
Jan 22, 1995
I continued working on the above, and wrote a function to analyse one
record. But when I checked it, somethings seemed to be missing. So, it
contains some bugs.
Jan 23, 1995
Got an email from Robert McNeel.
He told me that the R13c4 format will only change in that ACIS 1.5
is replaced by ACIS 1.6. They will only reverse engineer the format
by the end of the year. He also told me that
Cyco International (in Holland) has
reversed engineerd the R13 format. I immediately send out an email
to Vincent Everts to ask about this.
Jan 24, 1995
I modified rd_dxf.h and rd_dxf.c such that they the
listing of the groups that are read can be listed to a file, with some
options determining the printing of the position in the file, and
how the group-code is printed. This used to be a debugging option.
Jan 25, 1995: finishing a new version
I finished a new version. Maybe I should call this
version 0.1. This version supports the following options:
- -e: extensive output
- -np: don't print position in DWG-file.
- -dxf<m>: output DXF-file, m = 1 (position) + 2 (code) + 4
(code hexadecimal).
- -k<n>: only output records of kind n.
- -all: print also data outside records.
- -check_used: print information about whether doubles and string in
the DXF file were found in the DWG file.
The files of this version are:
- common.h.2 and
common.c.1
- fio.h.3 and
fio.c.2
- rd_dxf.h.2 and
rd_dxf.c.2
- rd_dxf.h.2 and
rd_dxf.c.2
- scan.c.6
Jan 27, 1995: comparing the results for LINE
I adapted scan.c a little, make the compact output even more
compact. I found the following sequences:
8B A0 20 00 05 EC 0b d10 00b d20 1000b d11 00b d21 AA 70 40 CC 14 4C 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00001b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 28 40 11000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 28 41 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 28 41 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 F0 40 D0 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 70 41 18 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 70 41 18 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C A0 28 40 10000b
8B A0 20 00 15 EC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 00110b
8B A0 20 00 15 EC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 10110b
8C A0 20 00 15 20 AC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8C A0 20 00 14 20 AC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D4 45 90 10 00000b
84 A0 40 00 14 20 AC 0b d10 00b d20 00b d30 00b d11 00b d21 00b d30 00b d39 A4 C1 44 3D 44 59 01 00 0b
94 A0 20 00 14 20 AC 0b d10 00b d20 1000b d11 00b d21 1000b d39 A4 C1 44 3D 44 59 01 00 0b
The code d10 stands for double with group code 10.
There is only one line starting with `9'. It is unclear what this means.
Then we have 'B':1011, 'C':1100, and '4':0100 as second letter. Attempts
to figure out what these mean, fail. It almost looks like
the last 3 bits are some kind of counter, saying how many special elements
are there.
(And this is how far I got, and ever will get!)
Last versions of the files are:
- common.h and
common.c
- fio.h and
fio.c
- rd_dxf.h and
rd_dxf.c
- s_dxf.h and
s_dxf.c
- scan.c
As these are working version, I cannot give any garantee that
they are correct, not even that they be compiled without errors.
How to crack a Binary File Format |
My life as a hacker |