How to crack a Binary File Format
By Frans Faase
Introduction
I often receive request from people to help them
with cracking a particular binary file format.
It seems that many software vendors are not willing
to document the format of the binary files that they
are using. Maybe there are good reasons for this.
Personally, I feel that users have the right to
access the information that is stored in the document
that they manipulate with a program that they are
legally allowed to use. That is why I do not consider
reverse engineering of a binary file format as a
illegal act (opposite to what the use of the verb
"cracking" is suggesting).
The most likely reason why people ask me for help is
because I already have reversed engineerd some
binary file formats. So far, I have worked on:
Although, cracking a binary file format can be very
rewarding, like a form of mental mountain climbing,
there are also some basic skills that are absolutely
neccessary, if you every want to succeed. These are
(in order of importance):
- A good reason why you want to crack the format at all.
- A lot of time.
- A lot of preseverence.
- A lot of creativity.
- Willingness to do repeative tasks for hours and hours.
- Good understanding of how numbers and strings
are represented in binary manner.
- Excellent command of the C programming language.
Cracking a file format is not easy. It can drive you crazy.
Try to avoid it if you can. And, of course, you first
search the Internet for at least one day seeing what is
already known about the format, or to find out if someone
already offers some conversion service or tool to a format
that you can read. One place to look at, is
"http://www.wotsit.org/" Wotsit, a site dedicated
to file formats.
Also, I think you should have read the stories of attempts
to crack a binary file format, and studied some programs
that read binary files. (Follow the links mentioned in the
introduction to start with.)
Besides this, you also need:
- a computer (quite obvious),
- a simple text editor capable of handling long lines,
- (optional) a hex-viewer or editor,
- a calculator with hexadecimal-to-decimal conversion and
vice-versa functionality,
- a C compiler,
- enough sample files about which you know what data
they contain, and
- (optional) the application which uses the files.
What is the fun?
Indeed, cracking a format is like climbing a
mountain, when you reach the top it gives you
a great reward. But it is also like entering
the mind of someone else. With every step you
come further, you know something more about the
program being used to read and write the file.
And by this, you also know more about the mind
of the person (or persons) who designed that
program. However, sometimes this person becomes
your enemy, when attempts have been made to prevent
total reverse engineering of the format. Yes,
those people designing a format sometimes are
given the assignment to make reverse engineering
difficult. There are many ways to do it. Custom
made compression and encrytion are two techniques.
Luckily, both these can reduce performance of
loading and saving. But there are also small
tricks you can use, which make it a lot harder.
Programming style
Although, cracking a binary file format may tempt
you to write instant code, you should realize that
it is a long and complicated task, which often
requires the code to undergo major revisions. Code
revisions are much easier if your code is clean and
well structured. (The principles of
eXtreme Programming do also
apply to this kind of programming work.)
Some practical remarks:
- Document your code (either through comments
or, better, by giving meaningful names to
variables and procedures).
- Make regular back-ups. Although this should
be a standerd procedure, with cracking a file
format you will see yourself more often going
back to an earlier version of the program
whenever you get stuck in a certain area.
- Build into your program switches (simple
Boolean variables) which allow you to dump
any intermediate data during any stage of
the parsing. You will find that you often
need to go back one step to make a step
forward.
How to work
Where to start
The first thing you need to do is to understand the
global structure of the binary files. Look at the
sample files with a hex-viewer, or write a small
hex-dump program, which dumps each next 32 bytes
on a separate line, generating an empty line
after each 256 bytes. (Alternatively, you can also
dump 256 bytes on each line.)
You can use one of the following statements to
print a single byte:
printf("%02X", byte);
printf("%02X ", byte);
printf("%c%02X", (byte > ' ' && byte < 127)?byte:' ', byte);
printf((byte > ' ' && byte < 127)?" %c":"%02X", byte);
Take care that byte is defined as a unsigned char,
otherwis you may get confusing effects. (If you don't
understand the above code fragments, you clearly do
not match the requirements, so,
don't ask me to explain it to you.)
Once the global structure has been discovered, you
can start writing a C program, which simply dumps
the file along this structure.
A lot the work, just consists of searching for
structure in the hex-dumps and editing them in long
lines representing the various records on a single line,
in order to discover the structure and determine the
meaning of the various parts.
Block structure
Some binary file formats use a block structure
where each chunk of data is stored in a number
of equal sized blocks. A block size of 256 is
often used. Examples of these are MS Word 5.0 and
Quark Xpress. Whenever a block structure is
used, data structures are often referenced by the
block number in which the data start.
Number representation
The most common used manner to represent numbers
is to interpret a sequence of bytes as a binary
number. Numbers can also be stored as strings,
where each character is represented one digit.
(These are relatively easy to recognize in a
binary file.) Because only a limit set of characters
is needed to represent all kinds of numbers (including
fractions and exponents), two characters can be
put in a single byte. In the past there were even
processors that had instructions to perform
arithematic operations on such numbers. (The two
parts of the byte were often refered to as nibbles.)
Nowadays this format is not in use any more (but
it might still occur in some binary file format,
as these sometimes survive their original platform).
Negative integers
For simple positive numbers interpreting a sequence of
bits as a binary number is the rather obvious choice.
But for numbers that can be both positive and negative
this is not. One interpretation is to consider one of the
bits (the first usually) as the sign bit, and the rest of
the bits as a binary number. A problem with this
interpretation is that there are two representations for
zero. Another solution is to interpret a sequence of
bits as a binary number, and to substract a large
value from it, in case the number is larger than some
number. There are two options here, with respect to
which numbers is substracted with a certain amouth of
available bits. These are called 1's compliment and 2's compliment
number representation. Nowadays, the 2's compliment method
is most often used. This means that for a byte (8 bits)
the number 256 is subtracted, when the number is larger
than 127. That means that the unsigned number 128
represent the signed number -128, and the unsigned number
255 represent the signed number -1. (With the 1's compliment
number representation, the number 255 is substracted.
Again, this introduces two representations for zero, as the
unsigned number 255 becomes the signed number 0.)
Note that with the 2' compliment the first bit can be
seen as the sign bit. The only disadvantage the 2' compliment
has is that there is always a negative number that does not
have a positive counterpart, e.g., for the byte there is
a representation for -128 but not for 128.
An important advantage of the 2-compliments notation is
that the arithmetic operations for the signed and unsigned
representation are exactly the same. The only time the
difference become obvious is when a number is cast to a
larger representation form. For example the hexadecimal
byte value 0xFF needs to be changed to 0xFFFF, when the
signed number -1 is cast from a byte to a word, and has
to be changed to 0x00FF, when the unsigned number 255 is
cast from a byte to a word.
In a binary file format, it can occur that a unsigned byte
value is stored as a word with the signed expansion applied.
As a result of this it can happen that 127 is stored
hexadecimal as 0x007F, where 128 is stored hexadecimal
as 0xFF80.
little and big endian
Processors can be divide into little and big endian.
This determines the order of the bytes in for short
and long integer values. little endian processors
start with the least significant byte. Big endian
processors start with the most significant byte.
The reason why little endian processors start with
the least significant by first, is because it gives
an advantage for 8 bits processors when having to
perform adding and substraction operations. Although
this plays no role in current 32 bit and 64 bit
processors, some processors are still little endian
because of their origin. The Pentium processor is
a little endian. The PowerPC (used in Macs) is
big endian. This difference is especially important
for exchanging binary data between different processors.
One should be aware that some applications running on
a big endian processor, may use the little endian
byte order, because the application was originally
developed to run on a little endian processor.
Sometimes, long values (32 bits) are stored as two
big endian words, with the least significant word
first.
floating point formats
Reading files
When reading a binary file, one often has to access
the file in a random manner, or at least often look
ahead from the current reading position.
Direct random access of a binary file is slow. To
avoid this, it is best (if size permits this) to
store the whole file into a single byte array.
(For an example see my CBuf class, which
I used for reading Quark Xpress
files.)
If size does not permit it, you could try the
Memory Mapped file option. For an example of
this under Windows see the class Mmfile
in the "http://www.suneido.com/"Suneido
sources that uses the functions CreateFileMapping
and MapViewOfFile.
The difference trategy
One way of cracking a binary file format is to make
many, preferable small, files with the application
that contain small changes compared to each other.
For example, you could start to write a document with
the application and each time save a document under
a different name.
You can use a Binary File Compare
utility to compare the files and discover how certain
changes are reflected in the binary file, which gives
you information about the where the information is
stored.
...
(Background image was taken from
"http://mathcs.wcsu.ctstateu.edu/joelw/Arch/"
here)
My life as a hacker