How to crack a Binary File Format

By Frans Faase

Introduction

I often receive request from people to help them with cracking a particular binary file format. It seems that many software vendors are not willing to document the format of the binary files that they are using. Maybe there are good reasons for this. Personally, I feel that users have the right to access the information that is stored in the document that they manipulate with a program that they are legally allowed to use. That is why I do not consider reverse engineering of a binary file format as a illegal act (opposite to what the use of the verb "cracking" is suggesting).

The most likely reason why people ask me for help is because I already have reversed engineerd some binary file formats. So far, I have worked on:

Requirements

Although, cracking a binary file format can be very rewarding, like a form of mental mountain climbing, there are also some basic skills that are absolutely neccessary, if you every want to succeed. These are (in order of importance): Cracking a file format is not easy. It can drive you crazy. Try to avoid it if you can. And, of course, you first search the Internet for at least one day seeing what is already known about the format, or to find out if someone already offers some conversion service or tool to a format that you can read. One place to look at, is "http://www.wotsit.org/" Wotsit, a site dedicated to file formats. Also, I think you should have read the stories of attempts to crack a binary file format, and studied some programs that read binary files. (Follow the links mentioned in the introduction to start with.)

Besides this, you also need:

What is the fun?

Indeed, cracking a format is like climbing a mountain, when you reach the top it gives you a great reward. But it is also like entering the mind of someone else. With every step you come further, you know something more about the program being used to read and write the file. And by this, you also know more about the mind of the person (or persons) who designed that program. However, sometimes this person becomes your enemy, when attempts have been made to prevent total reverse engineering of the format. Yes, those people designing a format sometimes are given the assignment to make reverse engineering difficult. There are many ways to do it. Custom made compression and encrytion are two techniques. Luckily, both these can reduce performance of loading and saving. But there are also small tricks you can use, which make it a lot harder.

Programming style

Although, cracking a binary file format may tempt you to write instant code, you should realize that it is a long and complicated task, which often requires the code to undergo major revisions. Code revisions are much easier if your code is clean and well structured. (The principles of eXtreme Programming do also apply to this kind of programming work.)

Some practical remarks:

How to work

Where to start

The first thing you need to do is to understand the global structure of the binary files. Look at the sample files with a hex-viewer, or write a small hex-dump program, which dumps each next 32 bytes on a separate line, generating an empty line after each 256 bytes. (Alternatively, you can also dump 256 bytes on each line.) You can use one of the following statements to print a single byte:
printf("%02X", byte);
printf("%02X ", byte);
printf("%c%02X", (byte > ' ' && byte < 127)?byte:' ', byte);
printf((byte > ' ' && byte < 127)?" %c":"%02X", byte);
Take care that byte is defined as a unsigned char, otherwis you may get confusing effects. (If you don't understand the above code fragments, you clearly do not match the requirements, so, don't ask me to explain it to you.) Once the global structure has been discovered, you can start writing a C program, which simply dumps the file along this structure. A lot the work, just consists of searching for structure in the hex-dumps and editing them in long lines representing the various records on a single line, in order to discover the structure and determine the meaning of the various parts.

Block structure

Some binary file formats use a block structure where each chunk of data is stored in a number of equal sized blocks. A block size of 256 is often used. Examples of these are MS Word 5.0 and Quark Xpress. Whenever a block structure is used, data structures are often referenced by the block number in which the data start.

Number representation

The most common used manner to represent numbers is to interpret a sequence of bytes as a binary number. Numbers can also be stored as strings, where each character is represented one digit. (These are relatively easy to recognize in a binary file.) Because only a limit set of characters is needed to represent all kinds of numbers (including fractions and exponents), two characters can be put in a single byte. In the past there were even processors that had instructions to perform arithematic operations on such numbers. (The two parts of the byte were often refered to as nibbles.) Nowadays this format is not in use any more (but it might still occur in some binary file format, as these sometimes survive their original platform).

Negative integers

For simple positive numbers interpreting a sequence of bits as a binary number is the rather obvious choice. But for numbers that can be both positive and negative this is not. One interpretation is to consider one of the bits (the first usually) as the sign bit, and the rest of the bits as a binary number. A problem with this interpretation is that there are two representations for zero. Another solution is to interpret a sequence of bits as a binary number, and to substract a large value from it, in case the number is larger than some number. There are two options here, with respect to which numbers is substracted with a certain amouth of available bits. These are called 1's compliment and 2's compliment number representation. Nowadays, the 2's compliment method is most often used. This means that for a byte (8 bits) the number 256 is subtracted, when the number is larger than 127. That means that the unsigned number 128 represent the signed number -128, and the unsigned number 255 represent the signed number -1. (With the 1's compliment number representation, the number 255 is substracted. Again, this introduces two representations for zero, as the unsigned number 255 becomes the signed number 0.) Note that with the 2' compliment the first bit can be seen as the sign bit. The only disadvantage the 2' compliment has is that there is always a negative number that does not have a positive counterpart, e.g., for the byte there is a representation for -128 but not for 128.

An important advantage of the 2-compliments notation is that the arithmetic operations for the signed and unsigned representation are exactly the same. The only time the difference become obvious is when a number is cast to a larger representation form. For example the hexadecimal byte value 0xFF needs to be changed to 0xFFFF, when the signed number -1 is cast from a byte to a word, and has to be changed to 0x00FF, when the unsigned number 255 is cast from a byte to a word. In a binary file format, it can occur that a unsigned byte value is stored as a word with the signed expansion applied. As a result of this it can happen that 127 is stored hexadecimal as 0x007F, where 128 is stored hexadecimal as 0xFF80.

little and big endian

Processors can be divide into little and big endian. This determines the order of the bytes in for short and long integer values. little endian processors start with the least significant byte. Big endian processors start with the most significant byte. The reason why little endian processors start with the least significant by first, is because it gives an advantage for 8 bits processors when having to perform adding and substraction operations. Although this plays no role in current 32 bit and 64 bit processors, some processors are still little endian because of their origin. The Pentium processor is a little endian. The PowerPC (used in Macs) is big endian. This difference is especially important for exchanging binary data between different processors. One should be aware that some applications running on a big endian processor, may use the little endian byte order, because the application was originally developed to run on a little endian processor. Sometimes, long values (32 bits) are stored as two big endian words, with the least significant word first.

floating point formats

comingsoon.gif

Reading files

When reading a binary file, one often has to access the file in a random manner, or at least often look ahead from the current reading position. Direct random access of a binary file is slow. To avoid this, it is best (if size permits this) to store the whole file into a single byte array. (For an example see my CBuf class, which I used for reading Quark Xpress files.) If size does not permit it, you could try the Memory Mapped file option. For an example of this under Windows see the class Mmfile in the "http://www.suneido.com/"Suneido sources that uses the functions CreateFileMapping and MapViewOfFile.

The difference trategy

One way of cracking a binary file format is to make many, preferable small, files with the application that contain small changes compared to each other. For example, you could start to write a document with the application and each time save a document under a different name.

You can use a Binary File Compare utility to compare the files and discover how certain changes are reflected in the binary file, which gives you information about the where the information is stored.

...

comingsoon.gif

(Background image was taken from "http://mathcs.wcsu.ctstateu.edu/joelw/Arch/" here)


My life as a hacker