Tutorials - Compression (Part I)

Compression (Part I)
- The Huffman Algorithm -
by Lord Julus / 29A
January 2000

Foreword

Hello people, and welcome to the first breath in the so talked about Y2K... It looks to me that everything worked out just fine, so I decided to move my butt and start writing something again, after quite a while.

You might consider this article off-topic a little to the virus writing field, but actually the things are not at all like that. The raising Win32 programming and the little ways of stealthing the code withing the portable executable generated a quest for inserting within the PE without modifying the size of the file. This was done by many authors in many ways, like overwriting a useless section or hiding in the slack space which remains after the rounding to disk alignment. Others choosed however to use a compression algorithm to shrink a portion of the file and insert into the remaining space. This was done for example by JackyQwerty in the Redemption virus with his JQCoding algorithm, or by Vecna in Babylonia. The examples are probably many but this was just to show you the link...

Now, in the world there exist many compression algorithms, ones better than others. From them a few raised as of being recognized by everybody as the best ones. Let me tell you a few: Lev-Zimpel algorithms (also known as LZ), Shano-Fano algorithms, Running Length algorithm, Huffman algorithm. It is very hard to say which is the best because it depends very much of the raw data it works on. However, as a starter I decided to phocus myself on the Huffman algorithm. Unfortunately this is not quite the best of them as it gives lower compression ratios than most of the LZ algorithms. But, still it is a very interesting algorithm and it deserves all attention.

So, in this article I will present the basic theory behind the Huffman algorithm with examples, and then show you how I implemented this method in one of my latest release: Lord Julus' Huffman Compression Engine (LJHCE32).

Theory

The Huffman algorithm is based on the following idea: define the characters (which are normally coded on 8 bits) on less or more bits so that on average the total number of bits will be smaller. So, this algorithm, unlike others, doesn't work on strings, but on characters and this is why it's results are not very good.

The first idea anyone would have is to encode each of the characters that appear in the input buffer with the least bits possible. Let's imagine such an encoding:

	A = 1
	B = 0
	C = 01
	D = 10
	E = 11, etc...

However, this is a non-decodable codification, because an encoding like 111101 can be decoded as AAAAC, or EEC, or EAABA, etc... So, this shows that a special method must be used to obtain a good decodable code. Actually the real problem does not come from the way the code looks like, but from the fact that they are not unique defined from one side or another and this would require a mark to be set between them so that at decompression you can know which code is which. Of course using a separator between the codes is out of the question. Another method must be found.

Here is the first example that comes into my mind of unique codes:

	A = 1
	B = 01
	C = 001
	D = 0001

If you want to encode the string ABCDAABC you get:

	10100100011101001

If you read bit by bit you can remove the bits as they match a combination in the definition table above... Of course, this is just a simple example that shows you one of the important assets of the huffman codes and that is the unique form when read from left to right. The idea is to compare each bit of the encrypted code and match it with all of the codes in the definition table. Once you matched one code you can be absolutely sure that it's the one, because they are not repeating in the inside structure.

Of course, this is the very first of the problem, how to get unique codes. The other part of the problem is each character which code should it have?

Here we come to the second of the huffman's algorithm ideas and that is the percentage of appareance. Let us consider the following string:

	AABCEDAAABCDEAABCCCCDAA

Let's see how many times each character appears in the string:

	A - 9
	B - 3
	C - 6
	D - 3
	E - 2
    ----------
    Total - 23

Now, let's transform each appareance into a percentage relative to the total, following the formula:

	appareance * 100
	----------------
	     total


	A -  39.13%
	B -  13.04%
	C -  26.09%
	D -  13.04%
	E -   8.70%
    ---------------
    Total - 100.00%

The next step is to order our characters descending with the percentage of appareance. After this we will be moving on with the following idea: we add the last two smallest values and we obtain a composite position with a percentage equal to the sum of the last two. We rearrange the positions descending and we keep on going until we reach 100%:

 .------------------------------------------------------------------------.
 |  Phase 1   | Phase 2     | Phase 3      | Phase 4       | Phase 5      |
 |------------+-------------+--------------+---------------+--------------|
 | A - 39.13% | A  - 39.13% | A   - 39.13% | BCDE - 60.87% | ABCDE - 100% |
 | C - 26.09% | C  - 26.09% | BDE - 34.78% | A    - 39.13% |--------------'
 | B - 13.04% | DE - 21.74% | C   - 26.09% |---------------'
 | D - 13.04% | B  - 13.04% |--------------'
 | E -  8.70% |-------------'
 '------------'

Now, we are very advanced inside the algorithm and we are almost reaching the end. Now, all we need to do is to associate some bits with the characters and this is done by selecting from each phase the last two values and give a 0 to the smaller one and a 1 to the highest one (not considering the last phase which gives 100%):

 .-----------------------------------------------.
 |  D -  1  |  DE - 1  |  BDE - 1  |  BCDE -  1  |
 |  E -  0  |  B  - 0  |  C   - 0  |  A    -  0  |
 '-----------------------------------------------'

And now, let's read the codes. Start from the end towards the beginning and wherever you see the character add the specified code to it. We get:

		A = 0
		B = 110
		C = 10
		D = 1111
		E = 1110

Ok, that's it!! We got our huffman codes!!

After all this is done we just need to take the input buffer and instead of each character store it's code. Let's use the above codification and compress the string: AABCEDAAABCDEAABCCCCDAA

	00110101110111100011010111111100011010101010111100

Dividing it into bytes we get:

	00110101 = 35h
	11011110 = DEh
	00110101 = 35h
	11111100 = FCh
	01101010 = 6Ah
	10101111 = AFh
	00______ = 00h

So we compressed a 23 bytes length string into a 7 bytes length string. That's a 58% compression ratio.

This looks a little bit too easy, right? Yep... And I will explain why in a second... You noticed above how we encoded each character with a code and did you notice how their size was variable? Then, how can use decode the compressed string? Answer: we *must* save the characters definitions into the compressed output buffer. This is one of the biggest problem... how to store the definition table without affecting the output too much? Anyway, before looking into that, lets see how we can decompress our string:

	00110101110111100011010111111100011010101010111100

We start from left. We say: 0, is it in the definition table? Yes, it's A. We output A and delete the 0. Next bit, 0 again... Next bit: 1. Is it in the definition table? No! Read one more: 11. Is it in the definition table? No, read one more: 110. Is it in the definition table? Yes! It's a B, output the B and go on. Briefly:

	00110101110111100011010111111100011010101010111100
	AAB  C E   D   AAAB  C D   E   AAB  C C C C D	AA

Easy decompression!! Now let's phocus a little on the data structures we need to use in order to implement such an algorithm.

Data structures

Ok, before going to the data structures themselfes, let's try and see what do we really need to store. First of all we are working on an alphabet made up from maximum 256 characters. For each character we must store it's code (1 byte), it's percentage of appareance (which we will discuss later- 1 double word). Also, note that after each phase the composite codes grow until the last step, where they form that total alphabet... This would make us need a huge matrix of 256*256*9 = 589824 bytes!!!! Of course that is totaly bad!

Let's see the structures used:

	element struc
		value dd ?
		start dd ?
	element ends
	element_size equ size element

	list struc
	     char db ?
	     next dd ?
	list ends
	list_size equ size list

The first structure has two definitions. 'value' represents the percentage of appeareance and 'start' is a pointer to an element from the second structure. The second structure holds the character definition and a pointer to the next character. To define them it goes like this:

       chars label
	I=0
	REPT 256
	   db I 	  ; character ascii code
	   dd 0 	  ; null pointer in the beginning
	   I=I+1
	ENDM

To define the huffman codes we simply write:

        huffman element 256 dup(<0>)

The size of the data so far is then 256*13 = 3328 bytes. A little decrease, right? Ok, the above structures will be used to generate the huffman tree. We need one more structure to store the actual huffman codes:

	huffman_code struc
		     code     dw ?
		     huff_len dw 0
	huffman_code ends
	huffman_size equ size huffman_code

        codes huffman_code 256 dup(<0>)

As you can see the huffman code is stored on a maximum 16 bits. The next value represents the length of the code. This adds up 256*4= 1024 bytes, making our data total = 4352.

For the beginning we will have the data set as follows:

	huffman table	       chars table
        value:0, start:------> char:0  , next=0
        value:0, start:------> char:1  , next=0
	...
        value:0, start:------> char:255, next=0

This being initialized, we read the input buffer from beginning to the end and increment the value coresponding to each position. After doing so, we transform the number of appareances into percent. This is the reason why we need to store the length ona double word, because when calculating the percent we must use all decimals so we need to use the FPU:

	fild dword ptr [value]	  ; load the value
	fild dword ptr [_100]	  ; load 100
	fmul			  ; multiply
	fild dword ptr [length]   ; load length
	fdiv			  ; divide
	fstp dword ptr [value]	  ; and store the floating number

After doing so, we sort the first matrix descending. Let's consider our example above:

	A -  39.13%
	B -  13.04%
	C -  26.09%
	D -  13.04%
	E -   8.70%


	huffman table		   chars table
        value:39.13, start:------> char:'A', next=0
        value:26.09, start:------> char:'C', next=0
        value:13.04, start:------> char:'B', next=0
        value:13.04, start:------> char:'D', next=0
        value: 8.70, start:------> char:'E', next=0

So, we have each pointer pointing the right character in the other matrix (beware that you don't need to rearange the second matrix as you work on pointers to it's elements; you just arranged the pointers in the first matrix).

Then, this is how the first phase looks like:

	huffman table		   chars table
        value:39.13, start:------> char:'A', next=0
        value:26.09, start:------> char:'C', next=0
        value:21.74, start:------> char:'D', next:---->char:'E',next=0
        value:13.04, start:------> char:'B', next=0

So, we added the values of the last two elements, reordered the first matrix and made the 'next' pointer of the first of them pointing to the second element, basically creating a chain within the second list. We go on with the next phase:

 huffman table		    chars table
 value:39.13, start:------> char:'A', next=0
 value:34.78, start:------> char:'D', next->char:'E',next->char:'B',next=0
 value:26.09, start:------> char:'C', next=0

And the final phase:

 huffman table		    chars table
 value:60.87  start:------> char:'D', next->
                                      char:'E', next->
                                                char:'B', next->
							  char:'C',next=0
 value:39.13, start:------> char:'A', next=0

As we went so far, I hope you got the image: we have two parallel arrays, one holding the percentage and a pointer to the first character. Each character is either alone or end of list and thus it's pointer is null, or is followed by other characters and thus it's pointer points the next character. At every phase you simply have to locate the last two items in the first list and then follow the path with characters until you reach a null pointer. For the higher percentage you must set a 1 and for the lower one you must set a 0. The setting occurs in the third matrix defined: the huffman_codes matrix. For each character you set the specific bit and increase the length. Beware that you must add the bits in the leftward part of the code.

Ok, this was the hard part: generating the huffman tree and creating the codes. From here it goes very simple: just read each character from the input buffer and output the corresponding huffman code to the output buffer. This is done easily using an algorithm that concatenates variable length nibbles.

Already you noticed, probably, a weakness in the huffman encoding: the need to read the input buffer twice, once to get the percentages and once to actually compress.

Also, please note that in order to be able to decompress you must store at the beginning of the compressed data all the informations required. For the beginning you should store:

	       sign	       dw 0  ; a file signature
	       ver	       dw 0  ; the version of the compressor
	       orig_size       dd 0  ; the original file size
	       comp_size       dd 0  ; the compressed file size
	       file_crc        dd 0  ; the crc of the file
	       dic_size        dw 0  ; the dictionary size

After this the dictionary itself should follow. My way was to store each character followed by it's code length and then by the code itself. This provides an easy way of reading the dictionary at decompression.

These are the basic structure needed to create a huffman compression without being forced to use big ammounts of memory. This algorithm, however is far less optimal then the LZ family, where first of all one pass over the input buffer is enough and there is no need to store information like the dictionary. However, this algorithm is usefull most of the times when combined with other LZ methods (like ARJ or PKZip do).

Furthuremore let's take a look at some of the most important coding parts of a huffman compression algorithm, as I coded them in my LJHCE32 compression engine.

Coding it

While coding a compression algorithm there are a few goals that come without saying:

Best compression
Good speed in compression
Almost unnoticeable at decompression

Also, the decompression routine should be as small as possible (in the idea of creating a self-extractor).

In order to have a good compression speed one of the most important part is the sorting algorithm. In my code I used a bubble sorting algorithm, which works pretty good.

An almost unnoticeable decompressor is the one which manages to locate the codes as quick as possible. A good routine is needed, though, when checking the input data and matching it with the dictionary. Unfortunately in my LJHCE32 I used a not so optimized algorithm and most of the times the decompression lasts more than the compression (which is not normal). I will upgrade it in a newer version.

Also, your routines must take care of all errors that might occure and return the proper error code. The compression routine should notice if the output is bigger than the input and act accordingly. The decompression routine should check the integrity of the huffman table and codes and also check if the CRC of the decompressed data matches the initial one.

For the complete implementation check out the source code for the LJHCE32 compression engine.

Final word

I do consider that compression is a great tool that helps in virus writing. This is because you get more from one shot: you get the encryption, you get more available space, the decoding is very difficult unless you know exactly the decompression algorithm. I think everybody should consider writing compression engines and use them in their codes. As far as I know there already exist a few available freeware libs with compression engines which are pretty good, like ApLib, for example.

If somebody manages to write a compression engine using the huffman algorithm please send it over to me to... I am always eager for new stuff...

All the Best,

Lord Julus/29A (Jan.2000)