Polymorphic Encryption Algorithms

DNA Structure

This article explains all the steps needed to write a C++ program which dynamically generates encryption algorithms in x86 assembly code.

Off-the-shelf encryption algorithms

There is a wide variety of encryption algorithms available. The advantage of these algorithms is that they are heavily studied and their strengths and weaknesses are known. The first type are block cipher algorithms (data is encrypted in blocks of fixed size), including:

There are also stream cipher algorithms (which encrypt data one byte at a time), such as the popular RC4 algorithm. Most encryption algorithms are symmetric, which means that the same key is used for both encrypting and decrypting data. However, there are a number of asymmetric encryption functions, based on public key infrastructure. This group of algorithms includes:

The encryption component of these algorithms uses a public key, while the decryption component requires a private key. As the name implies, public keys can be safely published on the Internet (like the keys in the PGP encryption system). Unlike symmetric encryption algorithms, where a leaked key means anyone can read the message, asymmetric encryption algorithms ensure that only the holder of the private key can decrypt the message. Since the public and private keys are mathematically related, it is theoretically possible to “crack” a public key and obtain the private key, however there are no known practical ways of doing this. In the case of RSA, cracking the public key involves splitting it into its two prime factors (RSA public keys are the product of two very large prime numbers). This factoring process currently takes an enormous time. The RSA company even organized a public challenge to break RSA keys, where the largest prize was $US200,000 for cracking a 2048-bit key.

All the above algorithms are extensively documented, and implementations can easily be found for many programming languages.

Polymorphic engines

Polymorphism has several meanings in the software world. In this case, we are referring to unique, dynamically-generated code (i.e. computer instructions). This technique has its roots in computer viruses. Already in the days of MS-DOS, some computer viruses were encrypted by their creators in order to evade detection by antivirus software. Of course, the viruses needed to contain decryption routines, and these were quickly added to virus signature databases. Soon, virus creators developed decryption algorithms whose code was uniquely generated every time, which allowed viruses to be created which could not be detected using static signatures. The first recorded polymorphic engines date back to the year 1990.

In time, polymorphic algorithms evolved and became ever more sophisticated, in order to make it as difficult as possible for antivirus programs to analyze viruses and run them in an emulated environment. Some well-known algorithms included:

  • KME – Kewl Mutation Engine (by z0mbie and Vecna) – an engine which requires no input data and writes the decrypted code to the stack. Rather than running a decryption algorithm on some encrypted data, the decryption routine is specially constructed to generate the original data every time. This means there is no buffer of encrypted data to be found.
  • MMXE – MultiMedia eXtensions Engine (by Billy Belcebu) – a polymorphic engine which generates MMX code
  • TUAREG – Tameless Unpredictable Anarchic Relentless Encryption Generator (by Mental Driller) – introduces nonlinear data decryption and many innovative methods into the polymorphic engine
  • PPE-II – Prizzy Polymorphic Engine (by Prizzy) – a polymorphic engine which exploits MMX and FPU code, as well as intentional brute-force calculations in its code, so as to slow down operation of emulation strategies used by antivirus software

In most cases, this type of encryption was used in executable file infectors. Due to the level of complexity of polymorphic engines, and the necessity of an in-depth understanding of assembly language for their creation, these days polymorphic engines are rarely used (one exception to this is Virut which, I'd like to point out, is strongly suspected of being Polish in origin!)

However, polymorphic engines have found another use: software protection systems, or exe-protectors, such as PELock, ASProtect, EnigmaProtector, Themida, and Obsidium. They are used to make it more difficult for an application's code to be analyzed by crackers (i.e. people who try to break software protection), and to make it harder for automated unpackers to be created (i.e. tools which remove software protection). If you possess any shareware program which is protected by an exe-protector, there is a 99% chance that it uses a polymorphic encryption engine.

Our own polymorphic engine

In this article, we will cover all the steps necessary to create a simple polymorphic engine, which will serve to encrypt any supplied data, and generate a unique decryption procedure, with the encrypted data embedded within it.

To create this engine, we'll use the C++ language and the AsmJit library, which allows dynamic generation of assembly code. AsmJit makes it possible to create assembly code at the C++ level, as if you were writing it by hand. It is possible to create 32- or 64-bit instructions, and even pseudoinstructions. Thanks to this, the library can be used to target code for 32- and 64-bit environments. In our case we'll stick with 32-bit code.

Listing 1.Pseudocode in C++ for the decryption function that we will generate.

DWORD DecryptionProc(PDWORD lpdwOutput)
{
  // randomly generated encryption key
  DWORD dwDecryptionKey = 0xA39383D;

  // pointer to the encrypted data, which will be located
  // at the end of the decryption function
  PDWORD lpdwInput = reinterpret_cast<PDWORD>(&cEncryptedData);

  // main decryption loop, composed of
  // randomly selected encryption instructions
  for (DWORD i = 0; i < NUM_BLOCKS; i++)
  {
    DWORD dwInputBlock = lpdwInput[i];

    dwInputBlock ^= 0x453BC;
    dwInputBlock += dwDecryptionKey;
     ...

    lpdwOutput[i] = dwInputBlock;
  }

  // return the size of the encrypted block
  // (in bytes)
  return DECRYPTED_DATA_SIZE;

  // the encrypted data, stored at the end
  // of the function
  BYTE cEncryptedData[] = { 0xAB, 0xBA... };
}

Our decryption function will take just one parameter: a pointer to the output buffer where the decrypted data will go. The function will return the size of the decrypted data. Sound simple? Then let's get to work!

Random register selection

Our decryption function will use the standard 32-bit processor registers, and will obtain its one parameter from the stack using the stack frame pointer in the EBP register. The registers used for decryption, for holding a pointer to the encrypted data, and for holding a pointer to the output buffer, will be selected randomly each time the decryption routine is generated.

Listing 2.Random selection of registers for the decryption function.

///////////////////////////////////////////////////////////
//
// random register selection
//
///////////////////////////////////////////////////////////

void CMutagenSPE::RandomizeRegisters()
{
  // set random registers
  AsmJit::GPReg cRegsGeneral[] = { eax, ecx, ebx, edx, esi, edi };

  // shuffle the order of registers in the array
  mixup_array(cRegsGeneral, _countof(cRegsGeneral));

  // the register which will contain
  // a pointer to the encrypted data
  regSrc = cRegsGeneral[0];

  // the register which will contain
  // a pointer to the output buffer
  // (supplied as a function parameter)
  regDst = cRegsGeneral[1];

  // the register which will contain
  // the size of the encrypted data buffer
  regSize = cRegsGeneral[2];

  // the register which will contain
  // the decryption key
  regKey = cRegsGeneral[3];

  // the register which will contain
  // the current data value and on which
  // the decryption instructions will operate
  regData = cRegsGeneral[4];

  // set the register whose values will be
  // preserved across function invocations
  AsmJit::GPReg cRegsSafe[] = { esi, edi, ebx };

  // shuffle the order of the registers in the array
  mixup_array(cRegsSafe, _countof(cRegsSafe));

  regSafe1 = cRegsSafe[0];
  regSafe2 = cRegsSafe[1];
  regSafe3 = cRegsSafe[2];
}

This random choice of registers will contribute to the generated code being different each time.

Prologue of the decryption function

The prologue of the decryption function is simply the first section of the function, which contains some standard elements like setting up the stack frame and preserving registers. The stack frame is a special region reserved on the stack for holding local variables (if the function needs any) as well as holding the parameters supplied to the function. In our case the one parameter will be a pointer to the buffer where we will store the decrypted data.

Structure of the stack frame once the function is invoked
Figure 1.Structure of the stack frame once the function is invoked

Listing 3.Generating the prologue.

///////////////////////////////////////////////////////////
//
// generate the prologue of the decryption function
//
///////////////////////////////////////////////////////////

void CMutagenSPE::GeneratePrologue()
{
  // function prologue
  // first the original value of EBP is saved
  // so we can use EBP to refer to the stack frame
  if (rnd_bin() == 0)
  {
    a.push(ebp);
    a.mov(ebp,esp);
  }
  else
  {
    // equivalent to the instructions
    // push ebp
    // mov ebp,esp
    a.enter(imm(0), imm(0));
  }

  // if our function is called using the stdcall
  // convention, and modifies ESI, EDI, or EBX,
  // they must be saved at the beginning of
  // the function and restored at the end
  a.push(regSafe1);
  a.push(regSafe2);
  a.push(regSafe3);

  // load the pointer to the output buffer
  // into our randomly-selected register regDst
  // (this is the only parameter to the function,
  // passed on the stack)
  a.mov(regDst, dword_ptr(ebp, 0x08 + (4 * 0)));
}

Address of the encrypted data

Our dynamically generated code can be located anywhere in memory and launched from there (assuming the memory region has the appropriate executable flags set). In such cases we cannot refer to parts of the function using absolute memory addresses, because we simply don't know where the function will reside. The encrypted data is located just after the end of the decryption routine, and we don't know its address in advance either. We will need to determine its address at runtime, through the use of relative addressing.

We will employ the delta offset technique for this purpose. It depends on the use of the call assembly instruction, normally used to call a function. We take advantage of the fact that this instruction stores a return address on the stack before jumping to a chosen memory address.

Listing 4.Obtaining the current code address through the use of the delta offset technique.

Decryptor proc near

; preserve the original value of EBP
	push	ebp

; this puts the return address on the stack
	call	delta_offset
delta_offset:

; pop the return address into EBP
; it will be the address of the label 'delta_offset'
	pop	ebp

; load the true address of var_1 by using the
; relative address with reference to delta_offset
	lea	eax,[ebp + (var_1 - delta_offset)]

; verify that our delta offset address is
; correct; return TRUE / FALSE
	cmp	dword ptr[eax],0DEADC0DEh
	sete	al
	movzx   eax,al

; restore the value of EBP
	pop	ebp

	ret

; this data value is placed directly after
; the function
	var_1   dd 0DEADC0DEh

Decryptor endp

The return address generated by call delta_offset will refer to the next instruction, in this case pop ebp. Once we know the memory address of this instruction, we can work out the address of anything else in our function. In our case, we work out the location of our "encrypted" data by adding the number of bytes between the delta_offset: label and the end of the function.

An interesting alternative for calculating a delta offset address involves the use of the FPU instruction fnstenv, which stores the environment state of the FPU. Among other things, this contains the memory location of the most recently executed FPU instruction. Like call, this allows us to obtain the location of an instruction in our function and we can use a similar delta offset calculation to refer to data stored after the function.

Listing 5.Using FPU instructions to calculate a delta offset address.

DeltaOffsetFPUTest proc uses esi

; local buffer for FPU environment
	local	fpEnvironment[32]:byte

; initialize FPU
	finit

; execute any FPU instruction (e.g. fld1
; fldz, fldln2, fldlg2 etc.)
delta_offset:
	fldpi

; save the FPU environment state
	lea	eax,fpEnvironment
	fnstenv byte ptr[eax]

; read the address of the last executed
; FPU instruction – it should point to the
; address of the fldpi instruction, that is,
; the address of the delta_offset label
	mov	esi,dword ptr[eax+12]

; remove 1 register from the FPU stack
; (after fldpi)
	fistp   dword ptr[esp-4]

; get the address of var_1 by using its
; address relative to delta_offset
	lea	eax,[esi + (var_1 - delta_offset)]

; verify that our delta offset address is
; correct; return TRUE / FALSE
	cmp	dword ptr[eax],0ABBAh
	sete	al
	movzx   eax,al

	ret

; value written just after the function body
	var_1   dd 0ABBAh

DeltaOffsetFPUTest endp

This approach to the delta offset calculation is sometimes used in exploits, e.g. in the Metasploit framework. A similar technique is also used in debugger detection, where the fnstenv instruction will reveal that the most recently executed FPU instruction is not part of our program, but refers to some other FPU instruction, which is directly or indirectly executed by the debugger. In this way, it is possible to discover tracing of our code by debuggers including OllyDbg.

Delta offset calculations can rouse the suspicions of antivirus programs, since normal applications do not have a need for this kind of thing. The combination of call and pop r32 can cause the application to be suspected of containing malware. In order to prevent this, we must generate extra code between these two instructions, and utilize a different means of getting values from the stack.

Listing 6.Generating code for relative addressing.

///////////////////////////////////////////////////////////
//
// generate delta offset
//
///////////////////////////////////////////////////////////

void CMutagenSPE::GenerateDeltaOffset()
{
  // generate code which will allow us to
  // obtain a pointer to the encrypted data
  // at the end of the decryption function

  // decryption_function:
  // ...
  // call delta_offset
  // mov eax,1 | xor eax,eax ; \
  // leave                   ;  > unused instructions
  // ret 4                   ; /
  // delta_offset:
  // pop regSrc
  // add regSrc, (encrypted_data-delta_offset +
  // ...          + size of the unused instructions)
  // ret 4
  // db 0CCh, 0CCh...
  // encrypted_data:
  // db 0ABh, 0BBh, 083h...

  // create the delta_offset label
  lblDeltaOffset = a.newLabel();

  // generate 'call delta_offset'
  a.call(lblDeltaOffset);

  sysint_t posUnusedCodeStart = a.getOffset();

  // in order to avoid getting flagged by
  // antivirus software, we avoid the typical
  // delta offset construction, i.e. call + pop,
  // by inserting some unused instructions in
  // between, in our case a sequence that looks
  // like the normal code which returns from
  // a function
  if (rnd_bin() == 0)
  {
    a.mov(eax, imm(1));
  }
  else
  {
    a.xor_(eax,eax);
  }

  a.leave();
  a.ret(1 * sizeof(DWORD));

  // calculate the size of the unused code,
  // i.e. the difference between the current
  // position and the beginning of the
  // unused code
  dwUnusedCodeSize = static_cast<DWORD>(a.getOffset() - posUnusedCodeStart);

  // put the label "delta_offset:" here
  a.bind(lblDeltaOffset);

  posDeltaOffset = a.getOffset();

  // instead of the pop instruction, we will
  // use a different method of reading the
  // stack, to avoid rousing the suspicions of
  // antivirus programs

  //a.pop(regSrc);
  a.mov(regSrc, dword_ptr(esp));
  a.add(esp, imm(sizeof(DWORD)));

  // the address of the label "delta_offset:"
  // will now be in the regSrc register;
  // we need to adjust this by the size of
  // the remainder of the function (which we
  // don't know and will have to update later)
  // for now we use the value 987654321 to
  // ensure AsmJit generates the long form of
  // the "add" instruction
  a.add(regSrc, imm(987654321));

  // save the position of the previous DWORD
  // so that we can later update it to contain
  // the length of the remainder of the function
  posSrcPtr = a.getOffset() - sizeof(DWORD);
}

At the moment, we don't know the size of the remainder of the decryption function, which is why we will need to update this value later, once we have generated all the other instructions of the decryption function.

Encryption

The encryption of the data will occur in 4-byte blocks. The data can be smaller than this (it will be rounded up to a multiple of 4). Pseudoinstructions will be randomly generated for the encryption process, and they will later be used to generate the decryption code.

Listing 7.Encryption.

///////////////////////////////////////////////////////////
//
// generate the encryption keys, encryption instructions,
// and finally encrypt the input data
//
///////////////////////////////////////////////////////////

void CMutagenSPE::EncryptInputBuffer(PBYTE lpInputBuffer, \
                                     DWORD dwInputBuffer, \
                                     DWORD dwMinInstr, \
                                     DWORD dwMaxInstr)
{
  // generate an encryption key
  dwEncryptionKey = rnd_dword();

  // round up the size of the input buffer
  DWORD dwAlignedSize = align_dword(dwInputBuffer);

  // number of blocks to encrypt
  // divide the size of the input data
  // into blocks of 4 bytes (DWORDs)
  dwEncryptedBlocks = dwAlignedSize / sizeof(DWORD);

  PDWORD lpdwInputBuffer = reinterpret_cast<PDWORD>(lpInputBuffer);

  // allocate memory for the output data
  // (the size will be rounded to the
  // block size)
  di_valloc(&diEncryptedData, dwAlignedSize);

  PDWORD lpdwOutputBuffer = reinterpret_cast<PDWORD>(diEncryptedData.lpPtr);

  // randomly select the number of encryption instructions
  dwCryptOpsCount = rnd_range(dwMinInstr, dwMaxInstr);

  // allocate memory for an array which will
  // record information about the sequence of
  // encryption instructions
  di_valloc(&diCryptOps, dwCryptOpsCount * sizeof(SPE_CRYPT_OP));

  // set up a direct pointer to this table
  // in a helper variable
  lpcoCryptOps = reinterpret_cast<P_SPE_CRYPT_OP>(diCryptOps.lpPtr);

  // generate encryption instructions and their type
  for (DWORD i = 0; i < dwCryptOpsCount; i++)
  {
    // will the instruction perform an operation
    // combining regData and regKey?
    lpcoCryptOps[i].bCryptWithReg = rnd_bool();

    // the register we are operating on
    lpcoCryptOps[i].regDst = regData;

    // if the instruction doesn't use the regKey
    // register, generate a random key which
    // will be used in the operation
    if (lpcoCryptOps[i].bCryptWithReg == FALSE)
    {
      lpcoCryptOps[i].dwCryptValue = rnd_dword();
    }
    else
    {
      lpcoCryptOps[i].regSrc = regKey;
    }

    // randomly choose the type of encryption instruction
    lpcoCryptOps[i].cCryptOp = static_cast<BYTE>(rnd_range(SPE_CRYPT_OP_ADD, SPE_CRYPT_OP_NEG));
  }

  // encrypt the input data according to the
  // instructions we have just generated
  for (DWORD i = 0, dwInitialEncryptionKey = dwEncryptionKey; \
       i < dwEncryptedBlocks; i++)
  {
    // take the next block for encryption
    DWORD dwInputBlock = lpdwInputBuffer[i];

    // encryption loop: executes the sequence of
    // encryption instructions on the data block
    for (DWORD j = 0, dwCurrentEncryptionKey; j < dwCryptOpsCount; j++)
    {
      if (lpcoCryptOps[j].bCryptWithReg == FALSE)
      {
        dwCurrentEncryptionKey = lpcoCryptOps[j].dwCryptValue;
      }
      else
      {
        dwCurrentEncryptionKey = dwInitialEncryptionKey;
      }

      // depending on the encryption operation,
      // perform the appropriate modification
      // of the data block
      switch(lpcoCryptOps[j].cCryptOp)
      {
      case SPE_CRYPT_OP_ADD:
        dwInputBlock += dwCurrentEncryptionKey;
        break;
      case SPE_CRYPT_OP_SUB:
        dwInputBlock -= dwCurrentEncryptionKey;
        break;
      case SPE_CRYPT_OP_XOR:
        dwInputBlock ^= dwCurrentEncryptionKey;
        break;
      case SPE_CRYPT_OP_NOT:
        dwInputBlock = ~dwInputBlock;
        break;
      case SPE_CRYPT_OP_NEG:
        dwInputBlock = 0L - dwInputBlock;
        break;
      }
    }

    // store the encrypted block in the buffer
    lpdwOutputBuffer[i] = dwInputBlock;
  }
}

Setting up the decryption keys

Our algorithm will use a randomly generated encryption key, but to prevent the key appearing directly in the code, we will encrypt it with a second, random key. At the start of the decryption function we will load the encrypted key into regKey, and then decrypt it.

Listing 8.Generating initialization code for the regKey register.

///////////////////////////////////////////////////////////
//
// set up the keys which will be used to decrypt the data
//
///////////////////////////////////////////////////////////

void CMutagenSPE::SetupDecryptionKeys()
{
  // set up a decryption key in the regKey
  // register, which will itself be encrypted
  DWORD dwKeyModifier = rnd_dword();

  // randomly generate instructions to set up
  // the decryption key
  switch(rnd_max(2))
  {
  // mov regKey,dwKey - dwMod
  // add regKey,dwMod
  case 0:
    a.mov(regKey, imm(dwEncryptionKey - dwKeyModifier));
    a.add(regKey, imm(dwKeyModifier));
    break;

  // mov regKey,dwKey + dwMod
  // sub regKey,dwMod
  case 1:
    a.mov(regKey, imm(dwEncryptionKey + dwKeyModifier));
    a.sub(regKey, imm(dwKeyModifier));
    break;

  // mov regKey,dwKey ^ dwMod
  // xor regKey,dwMod
  case 2:
    a.mov(regKey, imm(dwEncryptionKey ^ dwKeyModifier));
    a.xor_(regKey, imm(dwKeyModifier));
    break;
  }
}

Decryption

Using the pseudoinstructions we generated earlier and which we used to encrypt the data, we will now generate instructions which perform the encryption process in reverse, and place them in a loop that repeats this process on each input block. Before the loop we will initialize the register we have called regSize with the number of blocks which need to be decrypted.

Listing 9.Generating the code of the decryption loop.

///////////////////////////////////////////////////////////
//
// generate the decryption code (for the main decryption loop)
//
///////////////////////////////////////////////////////////

void CMutagenSPE::GenerateDecryption()
{
  // set up the size of the encrypted data
  // (in blocks)
  a.mov(regSize, imm(dwEncryptedBlocks));

  // create a label for the start of the
  // decryption loop
  Label lblDecryptionLoop = a.newLabel();

  a.bind(lblDecryptionLoop);

  // read the data referred to by the
  // regSrc register
  a.mov(regData, dword_ptr(regSrc));

  // build the decryption code by generating each
  // decryption instruction in turn (reversing the
  // order and the operations that were used for
  // encryption!)
  for (DWORD i = dwCryptOpsCount - 1; i != -1L; i--)
  {
    // encryption was done either with the key
    // in register regKey, or a constant value,
    // so depending on this we need to generate
    // the appropriate decryption instructions
    if (lpcoCryptOps[i].bCryptWithReg == FALSE)
    {
      DWORD dwDecryptionKey = lpcoCryptOps[i].dwCryptValue;

      switch(lpcoCryptOps[i].cCryptOp)
      {
      case SPE_CRYPT_OP_ADD:
        a.sub(lpcoCryptOps[i].regDst, imm(dwDecryptionKey));
        break;
      case SPE_CRYPT_OP_SUB:
        a.add(lpcoCryptOps[i].regDst, imm(dwDecryptionKey));
        break;
      case SPE_CRYPT_OP_XOR:
        a.xor_(lpcoCryptOps[i].regDst, imm(dwDecryptionKey));
        break;
      case SPE_CRYPT_OP_NOT:
        a.not_(lpcoCryptOps[i].regDst);
        break;
      case SPE_CRYPT_OP_NEG:
        a.neg(lpcoCryptOps[i].regDst);
        break;
      }
    }
    else
    {
      switch(lpcoCryptOps[i].cCryptOp)
      {
      case SPE_CRYPT_OP_ADD:
        a.sub(lpcoCryptOps[i].regDst, lpcoCryptOps[i].regSrc);
        break;
      case SPE_CRYPT_OP_SUB:
        a.add(lpcoCryptOps[i].regDst, lpcoCryptOps[i].regSrc);
        break;
      case SPE_CRYPT_OP_XOR:
        a.xor_(lpcoCryptOps[i].regDst, lpcoCryptOps[i].regSrc);
        break;
      case SPE_CRYPT_OP_NOT:
        a.not_(lpcoCryptOps[i].regDst);
        break;
      case SPE_CRYPT_OP_NEG:
        a.neg(lpcoCryptOps[i].regDst);
        break;
      }
    }
  }

  // write the decrypted block to the output
  // buffer
  a.mov(dword_ptr(regDst), regData);

  // update the pointers to the input and ouput
  // buffers to point to the next block
  a.add(regSrc, imm(sizeof(DWORD)));
  a.add(regDst, imm(sizeof(DWORD)));

  // decrement the loop counter (the number of
  // blocks remaining to decrypt)
  a.dec(regSize);

  // check if the loop is finished
  // if not, jump to the start
  a.jne(lblDecryptionLoop);
}

The loop takes successive encrypted blocks of data from the buffer at the end of the function, then carries out all the decryption instructions on each block, and writes the result to the output buffer. After decrypting a data block, the pointers to the encrypted data and the decrypted data buffer are updated, and the loop counter which indicates the number of blocks remaining is decremented. If it has not reached 0, the loop is repeated.

Setting final register values

Our decryption function will return a value of type DWORD (a 32-bit value corresponding to unsigned int), which indicates the size of the decrypted data. This value will be returned in the EAX register.

Listing 10.Setting up the return value of the decryption function (the size of the decrypted data) as well as any other final values we wish to place in registers.

///////////////////////////////////////////////////////////
//
// set up output registers, including the function return value
//
///////////////////////////////////////////////////////////

void CMutagenSPE::SetupOutputRegisters(SPE_OUTPUT_REGS *regOutput, DWORD dwCount)
{
  // if there are no output registers to
  // set up, return
  if ((regOutput == NULL) || (dwCount == 0))
  {
    return;
  }

  // shuffle the order in which the registers
  // will be set up
  mixup_array(regOutput, dwCount);

  // generate instructions to set up the
  // output registers
  // mov r32, imm32
  for (DWORD i = 0; i < dwCount; i++)
  {
    a.mov(regOutput[i].regDst, imm(regOutput[i].dwValue));
  }
}

Our polymorphic engine allows any set of output registers to be defined (that is, we are not limited to setting the value returned in EAX). This makes it possible for the function to output extra values.

Function epilogue

The epilogue, in other words, the last portion of the function, is where the original value of the EBP register is restored, and if necessary, the sensitive registers ESI EDI EBX, whose state must be preserved across function calls according to the stdcall convention.

Listing 11.Generating the epilogue of the decryption function.

///////////////////////////////////////////////////////////
//
// generate epilogue of the decryption function
//
///////////////////////////////////////////////////////////

void CMutagenSPE::GenerateEpilogue(DWORD dwParamCount)
{
  // restore the original values of
  // registers ESI EDI EBX
  a.pop(regSafe3);
  a.pop(regSafe2);
  a.pop(regSafe1);

  // restore the value of EBP
  if (rnd_bin() == 0)
  {
    a.leave();
  }
  else
  {
    // equivalent to "leave"
    a.mov(esp,ebp);
    a.pop(ebp);
  }

  // return to the code which called
  // our function; additionally adjust
  // the stack by the size of the passed
  // parameters (by stdcall convention)
  a.ret(imm(dwParamCount * sizeof(DWORD)));
}

Alignment

Instructions which access memory are faster if the data they read or write is aligned, i.e. divisible by the size of the block of memory being accessed. For instance, our encrypted data is accessed in 32-bit blocks (since we are using 32-bit registers), which means memory addresses should be divisible by 4. This is due to processor cache effects. If memory addresses are correctly aligned, the data can be stored in the fast memory attached to the processor known as L1 cache. Non-aligned addresses will be read from the slower L2 cache or directly from the computer's RAM. For small blocks of data, this does not create a noticeable difference, but for large buffers the performance reduction can be significant and should be taken into account.

Listing 12.Aligning the size of the decryption function to the specified granularity.

///////////////////////////////////////////////////////////
//
// align the size of the decryption function
// to the specified granularity
//
///////////////////////////////////////////////////////////

void CMutagenSPE::AlignDecryptorBody(DWORD dwAlignment)
{
  // take the current size of the code
  DWORD dwCurrentSize = a.getCodeSize();

  // find the number of bytes that would
  // align the size to a multiple of the
  // supplied size (e.g. 4)
  DWORD dwAlignmentSize = align_bytes(dwCurrentSize, dwAlignment) - dwCurrentSize;

  // check if any alignment is required
  if (dwAlignmentSize == 0)
  {
    return;
  }

  // add padding instructions (int3 or nop)
  if (rnd_bin() == 0)
  {
    while (dwAlignmentSize--) a.int3();
  }
  else
  {
    while (dwAlignmentSize--) a.nop();
  }
}

Alignment between functions is normally achieved with the 1-byte instruction nop (no-operation) or int3 (an interrupt for the purpose of trapping into a debugger). Alignment is also performed for the beginning of loops, using instructions which have the same effect as nop but have a longer binary encoding, such as lea eax,[eax*8+eax+00000000]. In our case we won't bother with this level of optimization.

Fixing up delta offset addresses

Having generated the entire code of the decryption function, we can now correct any relative addresses used earlier, which in our case is just the address of the encrypted data.

Listing 13.Correcting the address of the encrypted data in previously generated code.

///////////////////////////////////////////////////////////
//
// correct all instructions making use of
// addressing relative to the delta offset
//
///////////////////////////////////////////////////////////

void CMutagenSPE::UpdateDeltaOffsetAddressing()
{
  DWORD dwAdjustSize = static_cast<DWORD>(a.getOffset() - posDeltaOffset);

  // correct the instruction which sets up
  // a pointer to the encrypted data block
  // at the end of the decryption function
  //
  // this pointer is loaded into the regSrc
  // register, and must be updated by the
  // size of the remainder of the function
  // after the delta_offset label
  a.setDWordAt(posSrcPtr, dwAdjustSize + dwUnusedCodeSize);
}

Appending the encrypted data

Now that the entire code of the function has been generated, we can write out the encrypted data block, so it will follow the code of the function.

Listing 14.Place the encrypted data block after the end of the decryption function.

///////////////////////////////////////////////////////////
//
// append the encrypted data to the end of the code
// of the decryption function
//
///////////////////////////////////////////////////////////

void CMutagenSPE::AppendEncryptedData()
{
  PDWORD lpdwEncryptedData = reinterpret_cast<PDWORD>(diEncryptedData.lpPtr);

  // place the encrypted data buffer
  // at the end of the decryption function
  // (in 4-byte blocks)
  for (DWORD i = 0; i < dwEncryptedBlocks; i++)
  {
    a._emitDWord(lpdwEncryptedData[i]);
  }
}

Our entire polymorphic engine

Now that you have seen the successive elements of the operation of our polymorphic engine, it's time to bring it all together. I have included the class header below, followed by the implementation of the main function which generates the polymorphic code.

Listing 15.Definition of the polymorphic engine class.

#include "mutagen.h"

class CMutagenSPE : private CMutagen
{
public:
  CMutagenSPE(void);
  ~CMutagenSPE(void);

  // the main function which performs the
  // encryption and generates the polymorphic
  // code
  CMutagen::erCodes PolySPE(PBYTE lpInputBuffer, \
                            DWORD dwInputBuffer, \
                            PBYTE *lpOutputBuffer, \
                            PDWORD lpdwOutputSize);

private:

  // a structure describing the values of
  // the output registers
  typedef struct _SPE_OUTPUT_REGS {

    // target register
    AsmJit::GPReg regDst;

    // value to write in this register
    DWORD dwValue;

  } SPE_OUTPUT_REGS, *P_SPE_OUTPUT_REGS;

  // description of an encryption operation
  typedef struct _SPE_CRYPT_OP {

    // TRUE if the operation is performed
    // on two registers; FALSE if it is
    // performed between the target register
    // and the value in dwCryptValue
    BOOL bCryptWithReg;

    AsmJit::GPReg regDst;
    AsmJit::GPReg regSrc;

    // encryption operation
    BYTE cCryptOp;

    // encryption value
    DWORD dwCryptValue;

  } SPE_CRYPT_OP, *P_SPE_CRYPT_OP;

  enum
  {
    SPE_CRYPT_OP_ADD = 0,
    SPE_CRYPT_OP_SUB,
    SPE_CRYPT_OP_XOR,
    SPE_CRYPT_OP_NOT,
    SPE_CRYPT_OP_NEG,
  };

  // buffer with the encryption operations
  DATA_ITEM diCryptOps;

  // pointer to the table of encryption
  // operations
  P_SPE_CRYPT_OP lpcoCryptOps;

  // count of encryption operations
  DWORD dwCryptOpsCount;

  // pointer to the encrypted data block
  DATA_ITEM diEncryptedData;

  // number of blocks of encrypted data
  DWORD dwEncryptedBlocks;

  // encryption key
  DWORD dwEncryptionKey;

  // AsmJit Assembler instance
  Assembler a;

  // the register which will store a pointer
  // to the data which is to be decrypted
  AsmJit::GPReg regSrc;

  // the register which will store a pointer
  // to the output buffer
  AsmJit::GPReg regDst;

  // the register which hold the size of the
  // encrypted data
  AsmJit::GPReg regSize;

  // the register with the encryption key
  AsmJit::GPReg regKey;

  // the register on which the decryption
  // instructions will operate
  AsmJit::GPReg regData;

  // the preserved registers (ESI EDI EBX in random order)
  AsmJit::GPReg regSafe1, regSafe2, regSafe3;

  // the delta_offset label
  Label lblDeltaOffset;

  // the position of the delta offset
  sysint_t posDeltaOffset;

  // the relative address of the encrypted data
  sysint_t posSrcPtr;

  // the size of the unused code between delta
  // offset and the instructions which get that
  // value from the stack
  DWORD dwUnusedCodeSize;

  // helper methods
  void RandomizeRegisters();
  void GeneratePrologue();
  void GenerateDeltaOffset();
  void EncryptInputBuffer(PBYTE lpInputBuffer, \
                          DWORD dwInputBuffer, \
                          DWORD dwMinInstr, \
                          DWORD dwMaxInstr);
  void SetupDecryptionKeys();
  void GenerateDecryption();
  void SetupOutputRegisters(SPE_OUTPUT_REGS *regOutput, \
                            DWORD dwCount);
  void GenerateEpilogue(DWORD dwParamCount);
  void AlignDecryptorBody(DWORD dwAlignment);
  void AppendEncryptedData();
  void UpdateDeltaOffsetAddressing();
};

Listing 16.The main function, which performs the encryption and generates the polymorphic code.

///////////////////////////////////////////////////////////
//
// main function - encrypts data and generates polymorphic
//                 decryptor code
//
///////////////////////////////////////////////////////////

CMutagen::erCodes CMutagenSPE::PolySPE(PBYTE lpInputBuffer, \
                                       DWORD dwInputBuffer, \
                                       PBYTE *lpOutputBuffer, \
                                       PDWORD lpdwOutputSize)
{
  ///////////////////////////////////////////////////////////
  //
  // check input parameters
  //
  ///////////////////////////////////////////////////////////

  if ( (lpInputBuffer == NULL) || (dwInputBuffer == 0) || \
       (lpOutputBuffer == NULL) || (lpdwOutputSize == NULL) )
  {
    return CMutagen::MUTAGEN_ERR_PARAMS;
  }

  // randomly select registers
  RandomizeRegisters();

  ///////////////////////////////////////////////////////////
  //
  // generate polymorphic function code
  //
  ///////////////////////////////////////////////////////////

  // generate function prologue
  GeneratePrologue();

  // set up relative addressing through the delta offset
  // technique
  GenerateDeltaOffset();

  // encrypt the input data, generate encryption keys
  // the additional parameters set the lower and upper
  // limits on the number of encryption instructions
  // which will be generated (there is no limit to this
  // number, you can specify numbers in the thousands,
  // but be aware that this will make the output code
  // quite large)
  EncryptInputBuffer(lpInputBuffer, dwInputBuffer, 3, 5);

  // generate code to set up keys for decryption
  SetupDecryptionKeys();

  // generate decryption code
  GenerateDecryption();

  // set up the values of the output registers
  SPE_OUTPUT_REGS regOutput[] = { { eax, dwInputBuffer } };

  SetupOutputRegisters(regOutput, _countof(regOutput));

  // generate function epilogue
  GenerateEpilogue(1L);

  // align the size of the function to a multiple
  // of 4 or 16
  AlignDecryptorBody(rnd_bin() == 0 ? 4L : 16L);

  // fix up any instructions that use delta offset addressing
  UpdateDeltaOffsetAddressing();

  // place the encrypted data at the end of the function
  AppendEncryptedData();

  ///////////////////////////////////////////////////////////
  //
  // free resources
  //
  ///////////////////////////////////////////////////////////

  // free the encrypted data buffer
  di_vfree(&diEncryptedData);

  // free the array of encryption pseudoinstructions
  di_vfree(&diCryptOps);

  ///////////////////////////////////////////////////////////
  //
  // copy the polymorphic code to the output buffer
  //
  ///////////////////////////////////////////////////////////

  DWORD dwOutputSize = a.getCodeSize();

  // assemble the code of the polymorphic function
  // (this resolves jumps and labels)
  PVOID lpPolymorphicCode = a.make();

  // this struct describes the allocated memory block
  DATA_ITEM diOutput;

  // allocate memory (with execute permissions) for the
  // output buffer
  di_valloc(&diOutput, dwOutputSize);

  // check that allocation was successful
  if (diOutput.lpPtr != NULL)
  {
    // copy the generated code of the decryption function
    memcpy(diOutput.lpPtr, lpPolymorphicCode, dwOutputSize);

    // provide the output buffer and code size to
    // this function's caller
    *lpOutputBuffer = diOutput.lpPtr;
    *lpdwOutputSize = dwOutputSize;

    MemoryManager::getGlobal()->free(lpPolymorphicCode);
  }
  else
  {
    MemoryManager::getGlobal()->free(lpPolymorphicCode);

    return CMutagen::MUTAGEN_ERR_MEMORY;
  }

  ///////////////////////////////////////////////////////////
  //
  // function exit
  //
  ///////////////////////////////////////////////////////////

  return CMutagen::MUTAGEN_ERR_SUCCESS;
}

Test

In order to test out the code, we will encrypt a text string with our engine and then call the generated code of the decryption function. It's important to remember that the memory where the code is found should be allocated with the correct executable flags. Otherwise, the operating system's protection mechanisms, like e.g. Windows' DEP (Data Execution Prevention) feature will throw an exception if you attempt to call the function.

Listing 17.Testing the polymorphic engine.

#include <conio.h>
#include "mutagen\mutagen.h"
#include "mutagen\mutagen_spe.h"

// declare the prototype of the decryption procedure
typedef DWORD(__stdcall *DecryptionProc)(PVOID);

int __cdecl main()
{
  // input data (in this case a simple string,
  // although it could be any data buffer)
  char szHelloWorld[] = "Hello world!";

  // create an instance of the polymorphic
  // engine
  CMutagenSPE *speEngine = new CMutagenSPE();

  // a pointer to the generated decryption
  // function will be placed here
  PBYTE lpcDecryptionProc = NULL;

  // the size of the decryption code (and
  // its encrypted payload) will go here
  DWORD dwDecryptionProcSize = 0;

  // encrypt the input data and dynamically
  // generate a decryption function
  speEngine->PolySPE(reinterpret_cast<PBYTE>(szHelloWorld), \
                     sizeof(szHelloWorld), \
                     &lpcDecryptionProc, \
                     &dwDecryptionProcSize);

  // write the generated function to disk
  FILE *hFile = fopen("polymorphic_code.bin", "wb");

  if (hFile != NULL)
  {
    fwrite(lpDecryptionProc, dwDecryptionProcSize, 1, hFile);
    fclose(hFile);
  }

  // cast the function pointer to the right type
  DecryptionProc lpDecryptionProc = reinterpret_cast<DecryptionProc>(lpcDecryptionProc);

  // the output buffer for the decrypted data
  char szOutputBuffer[128] = { 0xCC };

  // call the decryption function via its
  // function pointer
  DWORD dwOutputSize = lpDecryptionProc(szOutputBuffer);

  // display the decrypted text - if everything
  // went correctly this will show "Hello world!"
  printf(szOutputBuffer);

  return 0;
}

What does the generated code look like?

The generated code will look different each time. This is the main goal of a polymorphic engine, after all! I have included two different decryption functions below, both generated by our engine from the same input data.

Listing 18.Decryption function.

; beginning of decryption function (prologue)
	enter   0, 0

; save sensitive registers (stdcall)
	push	esi
	push	ebx
	push	edi

; get the parameter passed to this function
	mov	ecx, [ebp+8]

	call	delta_offset

; unused instructions
	xor	eax, eax
	leave
	retn	4

delta_offset:

; EBX will contain the address of the label delta_offset
	mov	ebx, [esp]
	add	esp, 4

; correct the address so that it points to the encrypted data
	add	ebx, 51h

; set up the decryption key
	mov	edx, 918D3D1Bh
	add	edx, 78170F2Ah

; number of blocks to decrypt
	mov	eax, 4

decryption_loop:

; take one block of encrypted data
	mov	esi, [ebx]

; decryption instructions (random)
	xor	esi, 138E6781h
	sub	esi, edx
	not	esi
	xor	esi, edx

; write the decrypted block to the output buffer
	mov	[ecx], esi

	add	ebx, 4
	add	ecx, 4
	dec	eax
	jnz	short decryption_loop

; set up values returned by the function
; namely the size of the decrypted data
	mov	eax, 0Dh

; function epilogue and return to caller
	pop	edi
	pop	ebx
	pop	esi
	leave
	retn	4

; align function to 16 byte boundary (using NOP instructions)
	db 5 dup(90h)

; encrypted data
	db 0B6h, 044h, 052h, 0B0h, 09Bh, 087h, 05Eh, 0B1h
	db 08Ch, 04Bh, 06Ah, 0F3h, 07Eh, 0ACh, 0B5h, 057h

Listing 19.Decryption function in another variant.

; beginning of decryption function (prologue)
	push	ebp
	mov	ebp, esp

; save sensitive registers (stdcall)
	push	edi
	push	ebx
	push	esi

; get the parameter passed to this function
	mov	esi, [ebp+8]

	call	delta_offset

; unused instructions
	xor	eax, eax
	leave
	retn	4

delta_offset:

; ECX will contain the address of the label delta_offset
	mov	ecx, [esp]
	add	esp, 4

; correct the address so that it points to the encrypted data
	add	ecx, 4Ah

; set up the decryption key
	mov	ebx, 76F71EBFh
	sub	ebx, 50531439h

; number of blocks to decrypt
	mov	edi, 4

; decryption loop
decryption_loop:

; take one block of encrypted data
	mov	eax, [ecx]

; decryption instructions (random)
	neg	eax
	sub	eax, 4B1A7C17h
	not	eax
	xor	eax, ebx
	neg	eax

; write the decrypted block to the output buffer
	mov	[esi], eax

	add	ecx, 4
	add	esi, 4
	dec	edi
	jnz	short decryption_loop

; set up values returned by the function
; namely the size of the decrypted data
	mov	eax, 0Dh

; function epilogue and return to caller
	pop	esi
	pop	ebx
	pop	edi
	leave
	retn	4

; align function to 16 byte boundary (using int3 instructions)
	db 5 dup(0CCh)

; encrypted data
	db 028h, 014h, 01Dh, 06Ah, 001h, 059h, 012h, 06Bh
	db 0F2h, 01Ch, 025h, 0ADh, 070h, 0C2h, 07Ch, 0CAh

What next?

Our polymorphic engine is pretty simple at the moment. We could add additional features, such as:

  • generating spurious instructions, known as junks – these are simply blocks of code which make the code more difficult to analyze in a debugger or disassembler
  • generating white noise – that is, instructions which don't affect the operation of the function (e.g. adding a value to a random register and then taking it away again)
  • generating equivalent instructions (code mutations) in various forms, joined by random comparisons and jumps
  • generating additional helper functions, e.g. returning values which were originally put in place by code in the main decryption function
  • changes to the calling convention used, e.g. using cdecl instead of stdcall
  • adding jumps to the code so the instructions are not executed linearly
  • and, most advanced – multilayer encryption, that is, generated code which in turn contains another layer of decryptor

This can all be taken quite a long way, especially making use of a library like AsmJit, which greatly simplifies the dynamic generation of assembly code, including the use of all current extensions of x86 and x64 processors.

References

About the Author

Bartosz Wójcik — the creator of , is interested in advanced reverse engineering topics, which he frequently discusses on his computer security blog www.secnews.pl. He also performs software security audits in terms of vulnerability to reverse engineering analysis and protection against cracking.