Reversing LZ91 from Commander Keen

I've been in a bit of a rut with reversing recently and I thought I'd go back to something that I reversed years ago, has a little bit of complexity, but is easy enough that I can focus on extracting one part at a time and getting something turned around in a day or two. I'm going to post the disassembly here, you can follow along at home by getting a copy of Commander Keen yourself (shareware, but also available on any good only game store). For those who are interested in the history of LZ91, take a look at Fabrice Bellard's site.

Let's start from the beginning by opening this up in your favorite disassembler:

image.png

We'll go through instruction by instruction so you can follow along at home. Here's lines 1-3:

push    es
push    cs
pop     ds

We've run into our first problem: what is the value of ES when we start a DOS app? According to OSDev.org, DS and ES both point to the Program Segment Prefix (PSP), a 256 byte (0x100) structure that gets loaded at the bottom of memory, followed directly by the program we've just loaded. We don't need to know much else at this point because you'll notice we never pop this back off the stack.

After this, we push CS and pop it into DS. We do this for two reasons:

  1. You can't MOV directly between segments
  2. We want to access some data from the current segment so we need to copy that into DS

So far so good. What's the next big chunk for then?

mov     cx, word_1C66C
mov     si, cx
dec     si
mov     di, si
mov     bx, ds
add     bx, word_1C66A
mov     es, bx
std
rep movsb

We see a bunch of things happening here:

  1. CX gets set (to 0x176)
  2. This value gets decremented and copied into both SI and DI (SI = DI = 0x175)
  3. We load DS into BX, add a value (0x0C14) and then put this into ES (ES = DS + 0x0C14)
  4. We call the STD command and finally do a REP MOVSB

This looks a lot like a memmove command. Normally we copy from the bottom though, like this:

mov cx, 0x176
xor si, si
xor di, di
rep movsb

And at the end both SI and DI point to 0x176, although they last copied a byte to 0x175. Here we're doing this in reverse order. SI and DI start pointing to 0x175, they finish pointing to 0xFFFF, but the final copy is to 0x0000 so we're in the same position.

Why are we doing the copy backwards though? Let's step through manually and it might make sense.

We can assume CS = DS = 0x0000 to make the math easier, and print it like this:

source = 0000:0000; // mem location 0x00000
destination = 0C14:0000; // mem location 0x0C140
count = 0x176;
memmove(destination, source, count);

So we're copying from a low memory location to a higher memory location. The reason we do the copy in reverse is because for a very small file, the memory regions might overlap. We're at the start of the loader so we don't care, but we might override some of the later parts of the loader. Let's try a small example and see what happens:

source = 0x00000
destination = 0x00004
count = 0x10

If we put some junk in memory we can follow this through forwards

// starting case
mem = ABCDEFGHIJKLMNOP0000
// one step
mem = ABCDAFGHIJKLMNOP0000
// two steps
mem = ABCDABGHIJKLMNOP0000
// three steps
mem = ABCDABCHIJKLMNOP0000
// four steps
mem = ABCDABCDIJKLMNOP0000
// five steps
mem = ABCDABCDAJKLMNOP0000

It's broken now! We're now just copying the data that we just copied. If we do this in reverse though:

// starting case
mem = ABCDEFGHIJKLMNOP0000
// one step
mem = ABCDAFGHIJKLMNOP000P
// two steps
mem = ABCDEFGHIJKLMNOP00OP
// three steps
mem = ABCDEFGHIJKLMNOP0NOP
// four steps
mem = ABCDEFGHIJKLMNOPMNOP
// five steps
mem = ABCDEFGHIJKLMNOLMNOP
// ...
// 15 steps
mem = ABCDEBCDEFGHIJKLMNOP
// 16 steps
mem = ABCDABCDEFGHIJKLMNOP

This way we still destroy the data in the middle, but we have a perfect copy at the new destination. It doesn't matter for our analysis why it's done this way, but it's also interesting asking ourselves why things are done a certain way, especially if it relates to code further down the track.

So now we're done with this, what do we have left?

push    bx
mov     ax, 2Bh ; '+'
push    ax
retf

Pushing data before a return is a classic sign of an indirect jump. BX still points to our new segment that we just copied the data to, so we're jumping to ES:002B. We can follow along in our disassembler by just jumping to 002B in the old code because we know it hasn't been overwritten.

Here's where I got to with comments:

image.png

One thing I left out is how we know we're relocating the whole loader... the trick is to look at where our code is. The MZ header loads us at 0C66:000E, and we're copying all of this segment. When we jump to the next function we can see it only goes to 0C66:0176 (including the data on the end). Not the best explanation, I know, and if this doesn't make sense then take a closer look in your own disassembly to see where we got this number from.

Anyway, let's look at function number 2 here:

image.png

This is going to be fun. Let's start from the top and see if we can make sense of it as we go:

mov     bp, word ptr cs:byte_1C660+8
mov     dx, ds

If we tidy this up we can see it loads 0x0C66 into BP, and loads DS into DX. We know that 0x0C66 is our original CS (but not relocated, this is important). We also haven't touched DS since our memmove so we know that this is the CS that we got loaded to (this has been relocated). In other words, BP = 0x0C66, DX = 0x0C66 + relocation_offset.

We then hit the start of a loop:

loc_1C692:
mov     ax, bp
cmp     ax, 1000h
jbe     short loc_1C69C

BP is our original un-relocated CS, and it looks like our loader gets loaded at the end of the packed data. In other words, this could also be the size in paragraphs (0x10 byte chunks, what the segment register operates in) of our packed data. We compare it to 0x1000, and if it's greater, then:

mov     ax, 1000h

So this does seem like a counter that goes in chunks of 0x1000 at a time. The next chunk is where the magic happens:

loc_1C69C:
sub     bp, ax
sub     dx, ax
sub     bx, ax

If BP is the total we were meant to do then this lines up with our theory - we do this in chunks of up to 0x1000 at a time and here we're reducing BP (our counter for total amount of work to do), DX (our original CS), and BX (the segment that we relocated the loader to). We just relocated the loader further up in memory, and it looks like we might be about to do the same with the packed code:

mov     ds, dx
mov     es, bx

The MOVS commands go from DS:SI to ES:DI, so it does appear that we're just moving the packed code up to line up with our loader (now that we've jumpeed up here we can overwrite ourselves)

mov     cl, 3
shl     ax, cl

This seems odd. We're setting CX to AX 8. We would expect it to be AX 0x10 if we were doing MOVSB (since we know that AX is a number of paragraphs to copy, or multiples of 0x10). If we look further down we see a REP MOVSW, and 0x10 bytes is the same as 8 words, so this does seem to line up correctly.

mov     cx, ax
shl     ax, 1
dec     ax
dec     ax
mov     si, ax
mov     di, ax
rep movsw

This is the same pattern we saw before, but we need to decrement AX twice because our offsets are going in 2's because we're moving words, not bytes. This is also why we had to decrement our segments, because we need our offsets to start high and drop all the way down to 0.

or      bp, bp
jnz     short loc_1C692

And this is the bottom of the loop - if we had more than 0x1000 paragraphs to copy then go back and do the next chunk. For Commander Keen we only have 0x0C66 so we only run through this loop once.

Here are my notes so far:

image.png

One thing to keep in mind - none of this relocation matters for us when we write our unpacker because we're just reading from disk.

Anyway, next part:

image.png

We start with this:

cld

This sets the direction to forwards, so our MOVS/LODS/STOS commands move SI/DI forwards instead of backwards.

mov     es, dx
mov     ds, bx
xor     si, si
xor     di, di

We're swapping our segments around, we were copying from low to high, and now we're copying from high to low. Looks like we're going to start overwriting our original packed code with unpacked code as we go along. We've also cleared out SI and DI so we are definitely starting from the beginning.

mov     dx, 10h
lodsw
mov     bp, ax

We've loaded a word into BP and set DX to 0x10. I wonder what this means?

loc_1C6C9:
shr     bp, 1
dec     dx
jnz     short loc_1C6D3

Oh this makes sense - BP holds 16 bits, and we can shift them out one at a time. Here we've shifted a bit out of BP, and then decremented DX. If DX isn't zero then we skip the next bit, but if it DX is zero then we've run out of bits and:

lodsw
mov     bp, ax
mov     dl, 10h

We fill up again and reset DL (we know DH is 0).

loc_1C6D3:
jnb     short loc_1C6D8

This is weird. What does JNB do? It turns out this jumps if CF is 0, and what sets the CF? Normally we expect arithmetic to set/clear all the flags, but it the DEC command doesn't set CF (it sets others though, like OF and ZF). That means we're testing the result of the SHR command earlier - we're checking if the bit we shifted out was 0. If it was 1 (CF = 1 so the JNB fails) then we do this:

movsb
jmp     short loc_1C6C9

And we copy a byte across uncompressed, and loop back to the top.

From what we know about LZ style compression, we normally copy across the first few bytes uncompressed and then start to get the benefits of our compression as we unpack more and more. Because of this, we'd expect the first few bytes to be uncompressed. Let's take a look at what the data shows, right at the start of the file after the header:

image.png

Just as we thought: the first word that gets loaded is 0xFFFF (16x 1 bit), so it copies across 16 bytes uncompressed. Note that we see 15 uncompressed bytes and then 0xFFFF, this is because we refresh our bits the moment they run out (after the 16th bit), we don't wait until after the operation is done.

Here are the notes so far:

image.png

So if we get a 1 bit then we copy a byte across, but what happens if we get a 0 bit?

image.png

image.png

loc_1C6D8:
xor     cx, cx
shr     bp, 1
dec     dx
jnz     short loc_1C6E4
lodsw
mov     bp, ax
mov     dl, 10h
loc_1C6E4:
jb      short loc_1C708

We clear CX and get another bit. We refill if needed (this happens every time so this is the last time I'll show this refill part), and if CF is set then we jump right. We'll handle that next, first off, let's handle the CF=0 case (so when we get 00 bits).

shr     bp, 1
; refill if needed
rcl     cx, 1
shr     bp, 1
; refill if needed
rcl     cx, 1
inc     cx
inc     cx

We set CX to 0 earlier, and the RCL command rotates in bits from the CF, letting us stack these up on the bottom. We basically copy the next 2 bits from BP into CX, then add 2 from here. This set CX to a number from 2-5 based on our bits in BP.

lodsb
mov     bh, 0FFh
mov     bl, al
jmp     loc_1C71B

We then load in another byte, sign extend it to a word (top bit is 1 so it's a negative number)

loc_1C71B:
mov     al, es:[bx+di]
stosb
loop    loc_1C71B

We then load in a byte from our output data, and then put it back on the end, and increment di. We know bx is going to be a negative number, and CX is a number from 2-5 so this code is copying a 2-5 byte chunk from the uncompressed code and sticking it on the front. This is standard for LZ-style compression.

So we now know what the following sets of bits do:

  • 1: copy byte across
  • 00: copy 2-5 bytes from up to 255 bytes back in uncompressed

Let's carry on with that right hand branch we skipped earlier:

image.png

This looks insane. This is the sort of thing that makes you want to just give up and cry. I'll let you cheat on this one and jump straight to my notes:

image.png

We load another 2 bytes and we do some weird arithmetic with it. BX gets 13 of the bits and then gets sign extended, allowing us to look up to 0x2000 bytes back. AH gets 3 bits (from the middle, for some reason), and we check if these are zero. If they aren't, then we skip the jump and get to this:

image.png

We add 2 to our AH and put it in CL. We can add to our list of branches:

  • 1: copy byte across
  • 00: copy 2-5 bytes from up to 255 bytes back in uncompressed
  • 01, next word != XXXXX000XXXXXXXXb, copy 2-9 bytes from up to 8192 bytes back in uncompressed

Now the next branch, what happens when AH is 0?

image.png

We get another byte. If that byte is 0 then we jump to another branch that doesn't loop back, that's our exit code.

  • 1: copy byte across
  • 00: copy 2-5 bytes from up to 255 bytes back in uncompressed
  • 01, next word != XXXXX000XXXXXXXXb, copy 2-9 bytes from up to 8192 bytes back in uncompressed
  • 01, next word == XXXXX000XXXXXXXXb, next byte == 0, break

What about when our next byte is 1? This is a fun one:

image.png

mov     bx, di
and     di, 0Fh
add     di, 2000h

We know DI is our offset pointer for uncompressed data. Here we're saving it in BX, taking just the bottom nybble (masking against 0x0F) and then adding 0x2000.

mov     cl, 4
shr     bx, cl
mov     ax, es
add     ax, bx
sub     ax, 200h
mov     es, ax

And here's the other half, we scale BX down by 4 bits (so it would make a good segment register), add ES, then subtract 0x200, and finally put it all back in ES.

It looks we're resetting DI and bumping ES forward to compensate, this is something we'd do when we're getting to the end of the segment. For example, if we get to 0000:E123, we'd want to make sure we have more room to write to, so we can easily rewrite this as 0E12:0003.

BUT... when we decompress we can look up to 0x2000 bytes backward, so we probably want DI to be at least 0x2000. We can rephrase 0E12:0003 and 0C12:2003, this gives us more room to grow upward, but also lets us reach back to decompress.

mov     bx, si
and     si, 0Fh
shr     bx, cl
mov     ax, ds
add     ax, bx
mov     ds, ax
jmp     loc_1C6C9

This is exactly the same thing but with DS:SI. Note that this doesn't get called all the time, only when the compressed code tells us to. That saves on CPU cycles while only costing a few bytes per 3-4KB of packed data, not bad at all.

  • 1: copy byte across
  • 00: copy 2-5 bytes from up to 255 bytes back in uncompressed
  • 01, next word != XXXXX000XXXXXXXXb, copy 2-9 bytes from up to 8192 bytes back in uncompressed
  • 01, next word == XXXXX000XXXXXXXXb, next byte == 0, break
  • 01, next word == XXXXX000XXXXXXXXb, next byte == 1, rephrase segment/offset pairs

We have one final case and then we're done!

mov     cl, al
inc     cx
jmp     short loc_1C71B

If next byte is 2 or higher then we increment it and use our large BX to look back, letting us copy up to 0x100 bytes:

  • 1: copy byte across
  • 00: copy 2-5 bytes from up to 255 bytes back in uncompressed
  • 01, next word != XXXXX000XXXXXXXXb, copy 2-9 bytes from up to 8192 bytes back in uncompressed
  • 01, next word == XXXXX000XXXXXXXXb, next byte == 0, break
  • 01, next word == XXXXX000XXXXXXXXb, next byte == 1, rephrase segment/offset pairs
  • 01, next word == XXXXX000XXXXXXXXb, next byte == 2, copy 3-256 bytes from up to 8192 bytes back in uncompressed

And we're done with the decompression! I've rewritten this in python as an unpacker, here's how it looks:

while True:
    bit = bitstream.get()
    if bit == 1:
        byte, = input_stream.read(1)
        output_stream.write(bytes([byte]))
    else:
        bit = bitstream.get()
        if bit == 1:
            lowbyte, highbyte = input_stream.read(2)
            copy_distance = 0xE000 | ((highbyte << 5) & 0xFF00) | lowbyte
            copy_distance = convert_unsigned_to_signed(copy_distance, 0x10)
            copy_amount = highbyte & 0x07
            if copy_amount:
                copy_amount += 2
                copy_within_output_stream(output_stream, copy_distance, copy_amount)
            else:
                copy_amount, = input_stream.read(1)
                if copy_amount == 0:
                    break
                elif copy_amount == 1:
                    # segment reshuffle, ignore
                    pass
                else:
                    copy_amount += 1
                    copy_within_output_stream(output_stream, copy_distance, copy_amount)
        else:
            high_bit = bitstream.get()
            low_bit = bitstream.get()
            copy_amount = (high_bit << 1) + low_bit + 2
            copy_distance, = input_stream.read(1)
            copy_distance = convert_unsigned_to_signed(0xFF00 + copy_distance, 0x10)
            copy_within_output_stream(output_stream, copy_distance, copy_amount)

Nice and easy. Let's look at where the break takes us at the end:

image.png

push    cs
pop     ds

We're done with the code, now we're looking at data from this segment

mov     si, 158h

That's pretty weird. DS is in this segment though, so what does DS:158 look like?

image.png

Ahhhh, so we're pulling data out from after the end of the loader. Let's see what we do with it:

pop     bx
add     bx, 10h

We haven't seen the stack in a while, have we? Right at the start we pushed ES and I said two things about ES:

  1. It points to the PSP segment
  2. The PSP is 0x100 bytes long, or in other words, CS is 0x10 bytes greater

So we're reconstructing the original CS that we got loaded into based on the ES that we pushed at the start.

mov     dx, bx
xor     di, di

DX stores old CS? DI holds 0? What could be coming? We have some branches coming up so let's see what happens:

loc_1C769:
lodsb
or      al, al
jz      short loc_1C784

We load a byte and jump if it's zero. Let's see what happens when it isn't zero:

mov     ah, 0
loc_1C770:
add     di, ax

We set DI to 0 at the start, so we're advancing DI by up to 0xFF (from the byte we loaded)

mov     ax, di
and     di, 0Fh

We save DI in AX, and just keep the bottom nybble

mov     cl, 4
shr     ax, cl

We shift AX down, this looks like something we'd add to a segment register

add     dx, ax
mov     es, dx

DX was our old CS from above, so we've added our new offset and store it in ES. In other words, we had DX:DI = CS:0000, we added a byte (let's say 0C66:0028), we stole the high bits from DI and put them into DX (e.g. 0C68:0008), then we put DX into ES.

add     es:[di], bx
jmp     short loc_1C769

BX still points to our original CS, and we're adding it to a value at ES:DI. This looks like what we'd do if we were applying relocations... and an EXE would have a relocation table that would need to be applied once we unpacked the EXE...

So we have branch 1:

  • byte > 0: advance DX:DI and apply relocation

The next branch, when the byte == 0:

loc_1C784:
lodsw
or      ax, ax
jnz     short loc_1C791

We load a word, what happens if it's 0?

add     dx, 0FFFh
mov     es, dx
jmp     short loc_1C769

We advance DX by a lot (0xFFF) and go back to the start.

  • byte > 0: advance DX:DI and apply relocation
  • byte == 0 and next word == 0: advance DX by 0xFFF

What about when the word is >1?

loc_1C791:
cmp     ax, 1
jnz     short loc_1C770

loc_1C770:
...

We've been here before, but we're advancing by a whole word, not just a byte.

  • byte > 0: advance DX:DI and apply relocation
  • byte == 0 and next word == 0: advance DX by 0xFFF
  • byte == 0 and next word > 1: advance DX:DI and apply relocation

And the last case is the end, so we'll hit this once we've applied all the relocations. Before we do this though, let's look at how the python for this works, given we aren't trying apply relocations, but rather we want to rebuild the relocation table for the new EXE we're going to make:

while True:
    first_byte, = input_stream.read(1)
    if first_byte > 0:
        relocation += first_byte
        relocations.append(relocation)
    else:
        lowbyte, highbyte = input_stream.read(2)
        total = (highbyte << 0x08) + lowbyte
        if total == 0:
            relocation += 0xFFF
        elif total == 1:
            break
        else:
            relocation += total
            relocations.append(relocation)

Now we can break apart the final section:

mov     ax, bx

This is the CS that we got loaded to, now stored in AX

mov     di, word ptr unk_1C664
mov     si, word ptr unk_1C666
add     si, ax

We can see further down that these get loaded into SS:SP so we can rename these vars accordingly. We relocate SI (the new SS) by CS to bump it up

add     word ptr unk_1C662, ax

This var looks like a segment register

sub     ax, 10h
mov     ds, ax
mov     es, ax

AX now points to the PSP and we load this into DS and ES

xor     bx, bx

BX = 0...

cli
mov     ss, si
mov     sp, di
sti

Nice to stop interrupts while we're messing with the stack

jmp     dword ptr cs:[bx]

We're jumping to the far pointer at CS:0000. What's there?

image.png

Ahhhhh, so this is the original CS:IP that the packed EXE goes into (which is why we had to relocate the segment stored there). Full notes:

image.png

This was a lot of fun to reverse, fun to write up, and also quite enjoyable putting an unpacker together. You can find the unpacker here, it'll give you a decompressed copy of commander keen and rebuild the EXE header + relocations for you.

Happy reversing dudes