Cracking the DWG R13 file format

This page gives an account of an attempt to crack the DWG R13 file format, which is the AutoCAD native drawing format. I already cracked most of the R12 format, and heard that the R13 format is very different, so I decided to have a look myself, and see whether I can make something out of it.

All software referenced in this page is copyrighted by Frans Faase

Introduction

The most important thing for cracking a file format is having the right tools. I prefer to use C programs which I wrote myself for finding the internal structure of a binary file. And, of course, you need a nice Hex viewer, such as the one of Norton Commander (or Midnight Commander, if you are working with Linux, as I am doing right now).

It is also important to have as much as possible information available. AutoCAD does have an alternative format for storing files, the DXF format, which is (almost) completely specified. This realy makes my job easier. I have got myself three DWG files with there DXF counter parts, to start with.

This time I have decided to write a program that analyzes a DWG R13 file by making use of the contents of the DXF file. I also have decided to use this as an example of how binary file formats can be decoded (be means of keeping an account, which is what you are reading now).

Lets see how far we will get.

Dec 23, 1995: Making a DXF low level reading program

I have to do this job again, although I did it in the past, but sadly that code is now owned by someone else. A good chance to try it in a new way. I am going to make a new set of low-level file reading procedures based on the ideas in bfc.c.

I wrote the files:

common.h and common.c: some common definitions and procedures.
fio.h and fio.c: low-level file reading procedures.
rd_dxf.h and rd_dxf.c: low-level DXF reading procedures for binary DXF files.
scan.c: the main program, for the moment, which is almost empty.

I learned some interesting things about DXF R13, which appears to be rather different from DXF R12. My documentation does not give all the details needed to read the format. So, now I find myself decoding the DXF R13 format.

the first thing that I discovered, was that the R13 does not any more use a single byte (with 255 as an escape value, for larger values) as for R12, but a word for the group code.

From some of the new group codes, I guessed the type, based on the general pattern that ?00 to ?09 are string and ?70 to ?78 are integers, and most other doubles. But then I got stuck with group code 340, which does not seem to be a integer nor a double. Enough for today.

Dec 24, 1995: The 310 to 369 group codes in DXF R13

The Customization Guide, on page 543 gives the following information:

310-319: Arbitrary binary chunkd
320-329: Arbitrary binary handles
330-339: Soft pointer handle (specifies pointer to other object in drawing)
340-349: Hard pointer handle (specifies pointer to other object in drawing)
350-359: Soft owner handle (specifies ownership to other object in drawing)
360-369: Hard owner handle (specifies ownership to other object in drawing)

Note that the 170-178 and the 270-278 groups code are ommited in this table; they appear in the group code summary for the DIMSTYLE, see page 561 to 563!

I decided to change the rd_dxf.c program, such that it gives the offset of the current group code, so that I can looked it up with my hex-viewer. I have to print it in hex, maybe the group code also. The group code 340, is followed by the string `10\0', and then 13h 01h, which looks like a groupcode (256 + 16 + 3 = 275). Lets assume that the group codes 320 to 369 are stored as string.

Funny, the group codes 280 to 289 (8-bit integer values) appear to be stored in two bytes. Okay, now I can correctly scan the three files that I am using as a start. I also added some argument scanning to the main program. My files are now:

common.h.1 and common.c.1
fio.h.1 and fio.c.1
rd_dxf.h.1 and rd_dxf.c.1
scan.c.1

These, and all other source files can be found here.

Collecting interesting strings and doubles

The next step is to collect those strings and doubles from the DXF file, which probably also occur in the DWG file. For each of these we will store the positions (= element number) where they occured in the DXF file.

I took all the strings with group codes 1 to 9 (included), but it turns out that there are many with 9 that are not interesting. Lets leave those out.

I made the scanning part, and decided to print the strings and doubles that are not found the DWG file, after we have scanned them. This resulted in the file scan.c.2. If appeared that most interesting strings are found, but that many doubles are not found, which surprices me.

Comparing the output of the program of two DWG files, showed up that some long word values differ. Probably again the pointers that tell where certain parts of the file start, just as in the DWG R12 format.

Dec 25, 1995: Finding some pointers

I added some extra field to each output line generated by scan.c which tells the starting position of this line in the file. Then I analyzed the difference between the output of two of the drawings. I found possible pointers at: 13, 30, 35, 44, 48, 53 and 95. Lets check what these are pointing at, and assume that these are borders of certain sections. Now search for doubles and strings that are found in the DXF file, and see whether these can give us some clue, about the meaning of certain sections.

The X-coordinate of $EXTMAX is at 335. At 249 there is a $PEXTMIN value. The section starting with pointer at 95, seems to have to do something with dim-styles.

Not much progress any more. Lets see if we can find more doubles in the DWG file. (Maybe some transformations have been performed, which make the values differ from those in the DXF file). That does also not give us much. I feel stuck now. What have they done, some kind of decoding (compression) or bit-shifting.

At the end of the file, there is a rather repeating part. There are pointer in there to one and the same location, which might be the start of the entities, as some of the tables seems to be located more at the end of the file, as appears from the strings I found. The DWG R12 format also had some pointers at the end, but most where in the first part as well.

Bit-shifting was the right idea! Now I almost find all the doubles, and those that I do not find look like to have a logical reason, why they are missing (time-stamp and angles in degrees). This is a great step, as it reveals alot of the structure. A good point to stop for today.

For the following files there are newer versions: fio.h.2, fio.c.2, and scan.c.3.

Dec 27, 1995: Finding at the entities (objects)

There is a small bug in scan.c: for a double, I did skip the wrong amount of bytes, namely str_len. (Maybe, I should concluded that most bugs I make are caused by copying some code from somewhere else, in order to quickly write something, and then forget to modify it correctly.) For a LINE entity we have:

10-group, starting at 3-bit offset
2-bit value
20-group (starting at 5-bit offset)
4-bit value
11-group (starting at 1-bit offset)
2-bit value
21-group (starting at 3 bit offset)
20 bytes to the next 10-group of the second line.

In total 1 + 20 + 4 * 8 = 53 (35h) bytes. Maybe this occurs as a length byte, just as in DWG R12. The closest is 2f.

I decided that it was beter to compare an empty DWG file, with one only a single entity, to find out the borders. I got myself some more DWG files, each containing a single line that has entity with all kinds of different things set (color, line-type, thickness and elevation).

(After some hours of comparing the output files:) It seems that once an entity added to an empty drawing, a viewport entity is being created. The first position at which things start to differ is 4020. I still feel that I am not making much progress.

Dec 28, 1995: Looking at the layers

After yesterday's futile attempts to do something with the entities, I decided that I should start looking at the layers for a change. It seems that many strings start with a byte, indicating its length.

After checking the positions of the (suspected) pointers (at 13, 30 ..), I have to conclude that they might not be pointers, or at least not from the beginning of the file.

I found some more information about the structure of the layer table: Each record starts with a word specifying the length of the data, followed by the date, which appears to be followed by a word CRC. This appears to be a repeated pattern! Is this another key to how the format is stored? Yes, it seems a rather consistent pattern. It makes me happy to find such a structure, after some days of hopelessly looking around, and finding nothing substantial. A good moment to quit for the time being.

Dec 29, 1995: Continuing the search

I adapted scan.c in such a way that it checks the structure I have found above. I let it start with the first position that I have found. It appears to work, and the position at which it fails is equal to the value given at position 44. This means that it is probably a pointer. Maybe, one of the other pointers is giving the start position of the records.

more /dev/hda3

Jan 2, 1995: Continuing the search

I checked the beginning of the record chain, and concluded that it is not determined by one of the pointers that I found so far. I decided to determine it, based on the number 0x71d01767 that was found in all files just before (10 bytes) the start of the record chain.

I decided to change the string matching based on the fact that all string start with a byte indicating the length.

The first byte of the record seems to indicate the kind of the record. The fourth to seventh look like to be long word. No, it seems that the first two bytes contain the kind. I have found the following:

40 40 - TEXT
41 00 - END
41 40 - BEGIN
41 80 - SEQEND
41 C0 - INSERT
42 80 - VERTEX
43 C0 - POLYLINE
44 80 - CIRCLE
44 C0 - LINE
46 C0 - POINT
4C 40 - VIEWPORT
4C C0 - LAYER
4D 40 - STYLE
4E 40 - LTYPE
50 40 - VPORT
50 C0 - APPID
51 40 - DIMSTYLE

From this it seems that the first two bit are always 01, and that the next 8 bits contain the kind. All these results are found in a new version of scan.c.4.

Jan 9, 1995: mail from Jason Osgood

Today I received an email from Jason, telling that the R13 format will change again for R13c4. He also suggested that I should contact Robert McNeel. Jason wrote:

He says his firm will reverse engineer the format and update their DWG routines once the format becomes stable. Robert is very much in favor of having the DWG format in the public domain. (The money he charges for their library of routines is to cover technical support and the like.)

Of course, I did send an email to Robert McNeel.

Jan 17, 1995

It has been some time ago that I have done something. Reini suggested that they may use some kind of default C++ routines to save and load the records.

I build the above table into the program.

Jan 18, 1995: checking an idea about AcDbRecord pointers

Yes my conclussion is right. The kind byte is followed by a length byte, followed by a number of bytes containing the group 5 (or 105) string just before the AcDb??Record found in a group 100 string. This resulted in the following version for scan.c.5.

Jan 19-21, 1995: developing the software

I have started to rewrite the software, for the purpose of analysing the records. I have created to new files s_dxf.h.1 and s_dxf.c.1, which contain routines for reading all the records stored in the DXF file. The records are recognized by the use of a handle (a 5 or 105 group). Now I still have to adapt scan.c such that it can analyse each record with the information found in the DXF file about the record.

Jan 22, 1995

I continued working on the above, and wrote a function to analyse one record. But when I checked it, somethings seemed to be missing. So, it contains some bugs.

Jan 23, 1995

Got an email from Robert McNeel. He told me that the R13c4 format will only change in that ACIS 1.5 is replaced by ACIS 1.6. They will only reverse engineer the format by the end of the year. He also told me that Cyco International (in Holland) has reversed engineerd the R13 format. I immediately send out an email to Vincent Everts to ask about this.

Jan 24, 1995

I modified rd_dxf.h and rd_dxf.c such that they the listing of the groups that are read can be listed to a file, with some options determining the printing of the position in the file, and how the group-code is printed. This used to be a debugging option.

Jan 25, 1995: finishing a new version

I finished a new version. Maybe I should call this version 0.1. This version supports the following options:

-e: extensive output
-np: don't print position in DWG-file.
-dxf<m>: output DXF-file, m = 1 (position) + 2 (code) + 4 (code hexadecimal).
-k<n>: only output records of kind n.
-all: print also data outside records.
-check_used: print information about whether doubles and string in the DXF file were found in the DWG file.

The files of this version are:

common.h.2 and common.c.1
fio.h.3 and fio.c.2
rd_dxf.h.2 and rd_dxf.c.2
rd_dxf.h.2 and rd_dxf.c.2
scan.c.6

Jan 27, 1995: comparing the results for LINE

I adapted scan.c a little, make the compact output even more compact. I found the following sequences:

8B A0 20 00 05 EC 0b d10 00b d20 1000b d11 00b d21 AA 70 40 CC 14 4C 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00001b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 28 40 11000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 28 41 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 28 41 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 F0 40 D0 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 70 41 18 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 70 41 18 00000b
8B A0 20 00 15 6C 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C A0 28 40 10000b
8B A0 20 00 15 EC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 00110b
8B A0 20 00 15 EC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 4C 10110b
8C A0 20 00 15 20 AC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D0 10 00000b
8C A0 20 00 14 20 AC 0b d10 00b d20 1000b d11 00b d21 AA 4C 14 43 D4 45 90 10 00000b
84 A0 40 00 14 20 AC 0b d10 00b d20 00b d30 00b d11 00b d21 00b d30 00b d39 A4 C1 44 3D 44 59 01 00 0b
94 A0 20 00 14 20 AC 0b d10 00b d20 1000b d11 00b d21 1000b d39 A4 C1 44 3D 44 59 01 00 0b

The code d10 stands for double with group code 10. There is only one line starting with `9'. It is unclear what this means. Then we have 'B':1011, 'C':1100, and '4':0100 as second letter. Attempts to figure out what these mean, fail. It almost looks like the last 3 bits are some kind of counter, saying how many special elements are there.

(And this is how far I got, and ever will get!)

Last versions of the files are:

common.h and common.c
fio.h and fio.c
rd_dxf.h and rd_dxf.c
s_dxf.h and s_dxf.c
scan.c

As these are working version, I cannot give any garantee that they are correct, not even that they be compiled without errors.

How to crack a Binary File Format | My life as a hacker |