Overview

Problem:

Are you a chicken?  
<host>:<port>  
https://2017.notmalware.ru/90ac48d8e27c81628dd56a6e3926a18cd6a399c3/insanity.tar.bz2  

Unlike my last write up - I did a lot of the work in solving this challenge so this write up will be a bit more detailed. In this challenge we were given a tar.bz2 (that was really an xz compressed file). This file contained an x64 binary (insanity) and 2 shared objects (libpocketsphinx.so.3 and libsphinxbase.so.3).

I started by searching for pocketsphinx to get an idea of what the shared library implements and quickly discovered the github page indicating that it does speech recognition. This tells me that at some point I will most likely need to send an audio file to the service to do something.

model

The first step was getting the program to run more than printing an error Failed to create recognizer, see log for details. Of course there's no log, but looking at the git repo for pocketsphinx and the code

  v4 = cmd_ln_init(
         0LL,
         v3,
         1LL,
         "-hmm",
         "./model/en-us",
         "-lm",
         "./model/en-us.lm.bin",
         "-dict",
         "./model/cmudict-en-us.dict",
         0LL);

shows that I need some sort of model files. Luckily the git repo has the files that worked. Doing a ln -s pocketsphinx/model/en-us/ model fixes the problem and lets the program start.

Audio

The next step was to get some audio to feed into the program and determine the proper format. I searched for some text 2 speech websites and found http://www.text2speech.net that I was able to use to create some audio mp3s (the site has changed format a bit since the competition, previously you could set volume level and voice).

pre-processing

The next step was getting the file into the application. A 4 byte length is read, then that amount of data is read and fed into inflate - so the data is compressed with zlib. The next step is to figure out what a block of simd instructions was doing:

*((_QWORD *)&v63 + 1) = 0x8000800080008000LL;
*(_QWORD *)&v63 = 0x8000800080008000LL;
v15 = _mm_stream_load_si128((__m128i *)&v63);  
do  
{
    v16 = _mm_stream_load_si128(v12);
    v17 = _mm_stream_load_si128(v12 + 1);
    _mm_stream_si128(v13, _mm_xor_si128(_mm_unpacklo_epi8(0LL, v16), v15));
    _mm_stream_si128(v13 + 1, _mm_xor_si128(_mm_unpackhi_epi8(0LL, v16), v15));
    _mm_stream_si128(v13 + 2, _mm_xor_si128(_mm_unpacklo_epi8(0LL, v17), v15));
    _mm_stream_si128(v13 + 3, _mm_xor_si128(_mm_unpackhi_epi8(0LL, v17), v15));
    v13 += 4;
    v12 += 2;
    --v14;
}
while ( v14 );  

To do this, I set a breakpoint before this block of instructions and took a look at the input and then a breakpoint at the end, and compared the input to the output. Looking at the difference, an input of 0x41 0x42 0x43 0x44 was converted to 0x00 0xc1 0x00 0xc2 0x00 0xc3 0x00 0xc4. At first I just assumed looking at the xor value that it was only xoring with 0x8000, but it turns out it's taking an 8 bit value and essentially multiplying it by 256 (turning it into a 16 bit value) and then xoring it with 0x8000.

speech 2 text

The next step was to figure out the correct format for the data. Looking at the tutorial we see that it needs to be a single-channel (monaural), little-endian, unheadered 16-bit signed PCM audio file sampled at 16000 Hz. So the next step was converting it. Luckily linux has command line utilities for just about everything and a quick google search (or dozen) lead to the magic command line of:
ffmpeg -y -i "$in" -f s8 -acodec pcm_s8 -ar 16000 -ac 1 $out.raw to convert an mp3 into a raw file with the correct format. I wish I could say that I immediately knew that I needed to do 8 bit because of the pre-processing done but there was a lot of trial and error in getting the correct format (including sending the sample files in the pocketphoenix repo)

text processing

The next step was to figure out what it was doing with the text processing:

insanity_count = 0;  
input_idx = 0;  
decoded_text = ps_get_hyp(v5, &v61);  
while ( *(_BYTE *)(decoded_text + input_idx) && (unsigned int)(data_idx - 2) <= 1000 )  
{
    if ( !memcmp((const void *)(decoded_text + input_idx), "insanity ", 9uLL) )
    {
        ++insanity_count;
        input_idx += 9;
    }
    else
    {
        if ( memcmp((const void *)(decoded_text + input_idx), "insane", 7uLL) )
            goto reloop;
        v22 = insanity_count;
        input_idx += 7;
        insanity_count = 0;
        opcode_buf[data_idx++] = v22;
    }
}

as we can see it is counting the times the word insanity appears and then when the word insane appears it stores this count into some buffer. So the next step was putting together an audio file of insanity and insane. Luckily with the help of that text2speech website and ffmpeg getting the speech2text for a single word went pretty quick. The next step was putting together a "sentence" that would properly decode.

Luckily silence is a simple as adding a bunch of zeros into the input stream so adding a pause was very simple: we just had to add some zeros between the words. knowing that the sample rate was 16000 hz, adding 4000 zeros would be a 1/4 second pause and that was plenty for the library to recognize distinct words.

much insanity

The next problem came when trying to add more than a few 'insanity's. The library would mess up and change an insanity to 'been to too' or some other nonsense. Looking at the code, the initial buffer read size of compressed data was limited to 64k chunks, but we weren't hitting that - after all, a 4000 zeros compressed really well, we were barely at 16k. So the next step was looking at how the data is decompressed.
Looking at the decompression buffer, it turns out that inflate is given a 64k buffer (aka 4 seconds worth of samples) to decompress into. That buffer is fed to the speech to text; and then inflate is called again, and then the next chunk of decompressed data is it fed to speech2text and so on. So after a bit of thought I reasoned that when the speech2text is fed part of a word in one round and then part the next it doesn't process it correctly.

So first I tried padding out the silence ('\x00') between 'insanity's to make sure that 3 would fit in a single 64k buffer (so ~1/3 a second between words as the word was slightly longer than a second).. This got me further but it would still mess up at around 8 or so.
Then I took a look at the hexdump of the insanity raw file.

00000000  00 ff ff 00 00 ff 00 00  00 00 ff ff 00 00 00 ff  |................|  
00000010  00 00 00 ff 00 ff 00 00  00 ff 00 00 00 00 00 00  |................|  

and I see that it starts with a \x00 ... so I started thinking, i wonder if the zlib is including that first \x00 in with the silence and not extracting the ending silence in a block because that \x00 doesn't fit in the buffer (since I was padding the decompressed data to exactly 64k).

I tried changing the amount of silence and the value used to '\x01' and between words to '\x03' and that worked; I could successfully send a sentence with a whole bunch (i think i tried 80) 'insanity's.

I'm going insane

So now we can decode a single sentence and store a byte into what I eventually called the opcode_buf. The code would continue to read a length and audio data, decode and store a byte until a length of 0 was sent at which point it would go to the emulator portion.

Emulator

The opcode_buf consisted of an array of qwords, with the first 2 initialized; index 0 was set to the address of the opcode_buf (or'd with 0x8000000000000000) and index 1 was set to the last decoded text (or'd with the same value). And then the values stored with insanity/insane-ness were added starting at index 2.

Looking at the operations it was easy to see that when the high bit is set, the value is treated as a pointer or string, and when it's not set it's treated as a numeric value (this also means no negatives with math).

I was quickly able to identify the operations:

  • 0 - exit
  • 1 - push value
  • 2 - add or strcat
  • 3 - subtract (numbers only)
  • 4 - multiply (numbers only)
  • 5 - divide
  • 6 - load
  • 7 - store
  • 8 - jump (if nonzero)
  • 9 - convert numeric to string (single byte only)
  • other - push value = (op_code - 10)

The pc started at index 2 - the first value stored using the speech 2 text. and the data stack started at 1 greater than index of last value stored (a 0 was pushed on in that position). This meant that depending on the operations you could get the pc and data_index to cross.

Every operation that used data values would 'pop' them off, then push the result.

Easier Code

The first thing I noticed is that the jump instruction did not validate that the target was within the 1000 qword opcode buf.. this meant that we could jump to other data... and since almost everything (compressed & uncompressed buffers and opcode_buf) were stored on the stack we could jump to data we sent without having to do the speech2text for everything.

The first step was to figure out how to get data onto the stack without inflate failing (which causes the program to bail) and the speech2text to return something valid (but not necessarily insanity/insane) as when it failed it returned NULL and the program crashed.

I settled on sending the word no followed by the data I wanted to send. When the speech2text decoded no, and it was checked for insanity/insane the program would just go back to the read length part and as long as I sent 0, the extra data would still be in the decompression buffer on the stack.

So now that I knew how to get data onto the stack the next step was getting the emulator to jump to that location. I figured out how far opcode_buf was from the uncompress buffer (not the xor masked buffer.. because I didn't want every other byte to be 0) using gdb. It turned out I needed a jump of over 128k bytes to just get to the start of the decompress buffer, then some to get over the 'no' audio. So I needed to figure out how to get a really large value in the data stack to use as a jump target.

Luckily we have a multiply instruction and doing some root calculations and guessing, we determined that 11*11*12*13 would get us far enough into the decompress buffer to get past the no.

So the instruction sequence to get opcodes to execute in the decompress buffer instead of the opcode buffer (so we can use larger values, and less insanity) turned out to be: (remember we had to send this with the multiple 'insanity's and a single insane audio to get the value we wanted)

  • add (2)
  • mult (4)
  • mult (4)
  • mult (4)
  • mult (4)
  • jmp (8)
  • 1 #truth value for jmp
  • 13
  • 12
  • 11
  • 11

The add instruction was necessary because after finishing reading a 0 was pushed onto the stack as the last data value and multiple would just cause 0's to propagate up.

This sequence would cause the pc to be set to 18883, which was 11976 bytes into the decompressed buffer (1896 bytes larger than the no.raw). Which was perfect.

rip control

Now that I could store large values it was time to figure out how to get execution. We quickly realized that the load and store operations also did not check bounds.

The store operation would load a value at a specific index in opcode_buf, the index and value were pulled from the data stack and as long as the index wasn't negative it would store the value at opcode_buf[index]. This allowed us to have a very simple stack overwrite with whatever we wanted. Given that we knew the relative location of opcode_buf to the return address (-0x41f98 *8 bytes) it was simple to overwrite the return address.
Load worked slightly differently, it actually took a parameter and a data value. The parameter could be either 0 or 1, 0 to load from the opcode buf, and 1 to load from the decoded text. Technically, it used the address stored in index 0 or 1 of the opcode buf. This address had to have the high bit set as if it was a string.
With this load we could read the return address (same offset as before), and then do emulator calculations to figure things out.

drip drip drip - leaking data slowly

Since the entire program was written inside of main, the return address was that of the call to main inside __libc_start_main. We need to leak some addresses in order to determine which libc was in use. Luckily the program printed the top value on the data stack (current item) when the exit instruction was used; if the high bit was set it printed a string, otherwise it printed the value in hex.

Leaking the __libc_start_main return was easy, we just had to load the return address into current data item and exit. However this address wasn't quite enough to get use an exact libc match. In order to leak other addresses finding the GOT would be most helpful. Luckily main calls other things and the return address inside of main is available. And it just so happens there is one call that happens every time before the emulator is run.

Knowing the offset of the bottom of main's stack frame to opcode_buf we can figure out that what index (-29) we need to load (which didn't have negative or positive bounds checks!). Loading this gave us the actual return address pushed by the call to write at 0x1369 in main(). We could then add the offset to the appropriate entry in the got to get the actual address. We couldn't just store this address in offset 0 of opcode_buf because it didn't have the high bit set and that would cause a subsequent load operation to fail (due to checks).

Since setting the high bit with an or operation is the same as adding the same value (since we know the high bit isn't set) we can use the add operations... however we cannot just add 0x800000000000000 as that is a negative number and add cannot be used. instead we first add 0x101 and then add 0x7ffffffffffffeff. The 0x100 is to account for any oddities that could arise with the push_value instruction (which I don't think there are any after looking at the code again during this write up).

The final trick was storing the value in index 0 and then doing another load operation to read the value and exiting.

system()

So now that we knew addresses of a few glibc functions we could look for the bin/sh single instruction gadget in libc. However after talking with some teammates they mentioned that this would not work because sh was busybox and it requires argv[] to be set... so plan 2.

Since we had libc, and knew were system was, and we can find the "/bin/sh" string, all I needed to do was load the address of "/bin/sh" string into rdi and call system. So I found a pop rdi gadget at 0x22b1a in libc, and the bin_sh string at 0x17ccdb, with system() at 0x46640. Now we could use those addresses as well as the address of the call main in __libc_start_main (0x21ec5) to write some emulator instructions to overwrite the stack.

The full chain is a little annoying, but it basically broke down into the following steps:

  • Load main's return address (__libc_start_main)
  • Add the offset for system
  • store the value at main's return rsp + 0x10
  • Load main's return address (__libc_start_main)
  • Add the offset for "/bin/sh"
  • store the value at main's return rsp + 0x8
  • Load main's return address (__libc_start_main)
  • Add the offset for pop rdi gadget
  • store the value at main's return rsp
  • exit instruction

The hardest part for us was finding the correct libc. Even when we found it, for some reason (not sure why) I couldn't run the program locally with that libc. But the address worked and we were able to capture the flag. (sorry I didn't save what it was)

Files:

Other Bugs

I'm just going to list the various other bugs found in the emulator as there were quite a few.

  • The strcat operation set the high bit on the result (no longer an address type)
  • The strcat operation will free a value even if it wasn't necessarily malloc'd
  • The push value would allow pushing of an address

These are probably not useful with a one-shot chance through the emulator.. but ya never know.