When and how to use an assembler. Assembly programming basics

Części silnika

The basics of programming in assembly, the design of the processor, registers, memory, instruction, and use of assembly language within C++ and Delphi.

1. Introduction to assembly

Assembly language, a low-level programming language which allows you to use all the features of a computer processor is nowadays somewhat forgotten by “modern” developers.

The main reason for this is that writing in assembly is not the simplest of tasks, and is very time-consuming (testing code, finding bugs etc.).

However, in some situations assembly may be an ideal solution. An example is any kind of algorithm where speed is essential, such as in cryptographic (i.e. encryption) algorithms.

Despite incredible advancements in compilers in recent years, algorithms such as Blowfish, Rijndael, Idea written in assembly and “manually” optimised show significant speed advantages over their counterparts written e.g. in C++ and compiled at the maximum optimisation level.

In addition to cryptography, assembly is also often used by game developers. The best example may be the game QUAKE 2. After the publication of its source code, it turned out that all the algorithms that require speed were written in assembly.

So let's get started. To be clear, I should add that in this article I will focus on assembly for x86 processors, and its use in a Windows environment.

2. Fundamentals of assembly

If you have never written in assembly, before you can even create the simplest program, you must first learn several fundamentals like the CPU registers, instructions, and the stack.

From the programmer's perspective, a standard processor (I will use the Intel Pentium MMX as an example, as it is all I've got :-) has a large range of instructions ranging from 8 to 16 to 32-bit x86 instructions, as well as floating point and MMX instructions.

2.1. CPU registers

The processor has eight 32-bit general purpose registers and flags register, as well as eight 80-bit coprocessor registers (st0 - st7) and an equal number of 64-bit MMX registers (mm0 - mm7). The processor also has several control registers, that we generally don't use.

What is a register? A register is like a memory cell, which can temporarily store data; we can exchange data between the registers, and perform logical operations and arithmetic on the registers. The Pentium processor is 32-bit, which means that each of the general purpose registers is 32 bits wide (corresponding to unsigned int in C). All 32-bit registers have a 16-bit half (a remnant from the 286 processor), while the 16-bit halves of registers EAX, EBX, ECX and EDX are each divided into two 8-bit halves:

Register Name 16-bit half 8-bit halves Description
EAX AX AH and AL Accumulator
EBX BX BH and BL Base
ECX CX CH and CL Counter for string operations and loops
EDX DX DH and DL Data
ESI SI n/a Source register for string instructions
EDI DI n/a Destination register for string instructions
EBP BP n/a Pointer to data within the stack, used by functions to locate parameters saved on the stack
ESP SP n/a Stack pointer

2.2. General purpose registers

When writing a program, or inline assembly code under Windows, you can use all the general purpose registers, but using the special registers ESP and EBP can interfere with the operation of the program. For example, if you reset the ESP register to zero within a function, the program will most likely crash later (e.g. if the program tries to return from the function).

2.3. The stack

The stack is an area of memory reserved for the needs of the program. These include passing parameters to functions (as 32-bit values), temporary data storage, and all local variables. When the program starts, the ESP register (stack pointer) points to the end of the stack. When data is stored on the stack, the ESP register is decremented, and the data is then stored in the memory location which ESP points to. To store data on the stack, the push instruction is used, for instance:

__asm {

    push    5                // store the number 5 (32 bit) on the stack
    push    eax              // save the contents of register EAX on the stack
    push    dword ptr[edx]   // save the contents of memory referenced by
                             // the EDX register

    sub    esp,4             // equivalent to 'push 5'
    mov    dword ptr[esp],5

    sub    esp,4             // equivalent to 'push eax'
    mov    dword ptr[esp],eax
}

To retrieve and remove a value from the stack, the pop instruction is used, which works in the opposite way to push. First the value is read from the address indicated by the ESP register, then the ESP register is incremented:

__asm {

    push    5                // store 4 32-bit values on the stack
    push    eax
    push    dword ptr[edx]
    push    13B0C032h

    pop    eax               // remove the most recent value from the stack,
                             // which in this case is the number 13B0C032h
    pop    dword ptr[edx]    // this operation does not change anything, since
                             // the value stored on the stack came from the
                             // location referenced by EDX and is simply being
                             // returned there

    pop    edx               // put the value originally held by EAX into EDX
    pop    ecx               // put the value 5 into register ECX

    push    5                // store the value 5 on the stack

                             // the following instructions simulate 'pop eax'
    mov    eax,dword ptr[esp]
    add    esp,4
}

2.4. Limitations in Windows

If you have written assembly programs under MS/DOS, where there were no limitations, you will need to be aware that there are some differences under Windows. As I said earlier, in assembly we can use all the instructions that the CPU supports, however some instructions are not permitted by the operating system, in our case Windows. For instance, if we use I/O port instructions, the compiler will not give an error, but the program will most likely crash if these instructions are executed under Windows.

Instructions which can cause the program to be terminated include the above-mentioned I/O port instructions, as well as instructions that refer to interrupts, segment registers and control registers.

Regarding the segment registers, Windows uses the flat memory model, which means that all code and data exists in the same memory space ranging from 0 up to 0xFFFFFFFF. So, when accessing memory there is no need to bother with segment registers. Unlike in MS-DOS, there is no need to use segment prefixes like DS:.

3. Using assembly language

To take advantage of the benefits of assembly, you must first check whether your development tools allow its use. Products such as Borland Delphi, Builder, Watcom C++ or Microsoft Visual C++ allow you to use (compile) assembly code; Visual Basic is the only popular RAD package which does not allow writing code in assembly. These products support the use of assembly code in two ways. The first is called inline assembly, where the assembly code is inserted into the regular code written in e.g. C++. The second method is linking modules (i.e. separate files) written in assembly with modules written e.g. in Delphi or C++.

3.1. Inline assembly

Before you start writing assembly code, you must check how to write it, because there are two types of syntax for assembly code. The first type is called “intel syntax”, and is used in products such as Delphi, Builder, MSVC, Borland TASM, Microsoft MASM (assembly compilers). This syntax is now the standard and is used in 90% of sources. The second type is called “at&t syntax”, and is used e.g. in C compilers, such as GCC (Linux platform), DJGPP and LCC.

Inline assembly is the easiest way to write asm code. When writing assembly code in Delphi or Builder, it must be enclosed between the asm keyword marking the beginning of the assembly code, and the end; keyword after the code. For example:

// our first 'hello world' in assembly, Delphi version
asm                          // start of assembly code

    mov    eax,1             // move the value 0x00000001 into register EAX
                             // the C++ equivalent of this instruction is the
                             // assignment operator '=', e.g.
                             // x = 1;
                             // the Delphi equivalent is the assignment
                             // operator ':=', e.g.
                             // y := 1;

    mov    ecx,eax           // move the contents of register EAX into
                             // register ECX, that is, the value 0x00000001
                             // will end up in ECX

    shl    ecx,2             // this 'Shift Left' instruction will shift the
                             // contents of register ECX to the left by 2 bits
                             // As you may know, left shifting serves to
                             // multiply values by successive powers of 2
                             // Shifting 0x00000001 to the left by two bits
                             // will result in the value 0x00000001 * 4 = 0x00000004
                             // saved to ECX
                             // in C++, bit shifts are achieved with the '<<'
                             // operator, e.g.
                             // x = y << 2;
                             // in Delphi, bit shifts use the same keywords as
                             // as assembly code, namely 'shl' or 'shr', e.g.
                             // x := y shl 2;

    shr    eax,1             // this 'Shift Right' instruction will shift the
                             // EAX register to the right by 1 bit

    and    eax,0             // 'And' is a logical multiplication of bits
                             // according to the following table:
                             // 0 * 0 = 0
                             // 1 * 0 = 0
                             // 0 * 1 = 0
                             // 1 * 1 = 1
                             // Any value multiplied by 0 will give 0; in this
                             // case, the EAX register will be zeroed out
                             // The C++ equivalent of this instruction is
                             // the '&' operator, e.g.
                             // x = y & 0;
                             // in Delphi:
                             // x = y and 0;


    or    eax,0FFFFFFFFh     // 'Or' is a logical sum of bits according
                             // to the following table:
                             // 0 + 0 = 0
                             // 1 + 0 = 1
                             // 0 + 1 = 1
                             // 1 + 1 = 1
                             // in this case EAX will be ORed with the value
                             // 0xFFFFFFFF, which will result in the value
                             // 0xFFFFFFFF no matter what EAX contains
                             // The C++ equivalent of this operation is the
                             // '|' operator, e.g.
                             // x = y | 0xFFFFFFFF;
                             // in Delphi:
                             // x := y or $FFFFFFFF;

    sub    edx,edx           // 'Subtract' subtracts the value of one register
                             // from another. In this case, EDX will become zero
                             // The C++ equivalent is '-', e.g.
                             // x = x - x;

    xor    eax,eax           // 'eXclusive Or' follows this table:
                             // 0 ^ 0 = 0
                             // 1 ^ 0 = 1
                             // 0 ^ 1 = 1
                             // 1 ^ 1 = 0
                             // This function yields 1 when its two inputs are
                             // different; if they are the same it will give 0
                             // Hence the instruction 'xor eax,eax' will zero
                             // out the EAX register
                             // The C++ equivalent is the '^' operator, e.g.
                             // x = x ^ y
                             // in Delphi:
                             // x := x xor y;

end;                         // end of assembly code

Writing inline assembly in MSVC only really differs in how the assembly code is introduced to the compiler:

// our second 'hello world' in assembly
__asm {                      // start of assembly code

    push    5                // save the value 0x00000005 on the stack
    pop    eax               // remove 0x00000005 from the stack and write
                             // it to register EAX

    push    eax              // save the contents of register EAX on the stack
                             // (in this case the value 5)
    pop    edx               // remove the value 5 from the stack and write it
                             // to register EDX

    mov    ax,0FFFFh         // write the value 0FFFFh to the 16-bit lower
                             // half of register EAX
    mov    dx,ax             // write the value from register AX to the 16-bit
                             // lower half of register EDX
    mov    al,11             // write the value 11 (decimal) to the 8-bit
                             // lower half of register AX
    mov    ah,11h            // write the value 11 (hex) to the 8-bit upper
                             // half of register AX, which is 17 in decimal
}                            // end of assembly code

3.2. Using variables in assembly

Writing in assembly, you have access to all global variables, and if the code is in a procedure, it also has access to the local variables and parameters of the procedure/function, so its capabilities are practically the same as normal code. An example of the use of global and local variables:

// global variables
var
    ByteVar: Byte;           // byte - 8 bits
    WordVar: Word;           // word - 16 bits
    IntVar: Integer;         // double-word - 32 bits
  ...

procedure noop;

// local variables of function 'noop'
var
    LocalByte: Byte;
    LocalWord: Word;
    LocalInt: Integer;

begin

    // initialise global variables
    ByteVar := $FF;          // 8-bit value
    WordVar := $FFFF;        // 16-bit value
    IntVar  := $FFFFFFFF;    // 32-bit value

    asm
        mov    al,ByteVar    // write an 8-bit value to an 8-bit register
        mov    LocalByte,al  // write an 8-bit value to a local variable

        mov    ax,WordVar    // 16-bit value to 16-bit register
        mov    LocalWord,ax

        mov    eax,IntVar    // 32-bit value to 32-bit register
        mov    LocalInt,eax
    end;

end;

The example for MSVC is not much different from that of Delphi:

// global variables
char ByteVar;
short WordVar;
int IntVar;
...

void noop()
{
    // local variables
    char LocalByte;
    short LocalWord;
    int LocalInt;

    // initialise global variables
    ByteVar = 0xFF;          // 8-bit value
    WordVar = 0xFFFF;        // 16-bit value
    IntVar  = 0xFFFFFFFF;    // 32-bit value

    __asm {

        mov    al,ByteVar    // write an 8-bit value to an 8-bit register
        mov    LocalByte,al  // write an 8-bit value to a local variable

        mov    ax,WordVar    // 16-bit value to 16-bit register
        mov    LocalWord,ax

        mov    eax,IntVar    // 32-bit value to 32-bit register
        mov    LocalInt,eax
    }

}

You can write entire functions in assembly language. When doing this, there are a few things to keep in mind. If the function returns a value, we must ensure that the returned value is stored in the EAX register before leaving the function. A simple example:

// Delphi version
function add(x, y:integer):integer;
asm
    mov    edx,x             // copy the function's first paramter to EDX
    mov    ecx,y             // copy the function's second parameter to ECX
    add    edx,ecx           // add x and y together
    mov    eax,edx           // write the result to register EAX
                             // this becomes the function's return value
end;
// C++ version
int mult(int x,int y)
{
  __asm {

    mov    edx,x             // copy the function's first paramter to EDX
    mov    ecx,y             // copy the function's second parameter to ECX
    imul   edx,ecx           // multiply x by y
    mov    eax,edx           // write the result to register EAX
                             // this becomes the function's return value
}

}

We already know that functions written in assembly must place the return value in the EAX register, but what about the other registers?

In short, registers EAX, EDX, and ECX may contain any value when the function exits, but registers EDI, ESI, EBX, and EBP generally must not change (their value must be the same as it was before the call). You may wonder why this is the case. Well, the code produced by the compilers of the HLL (high-level language) use this second group of registers throughout the program to hold e.g. addresses of functions, constants, etc., and if they are changed by a function, code that runs later may use invalid values, which can cause anything from data corruption to a crash. It is easy to prevent such errors:

// Delphi version
function count(w,x,y,z:integer):integer;
asm

    push    edi              // save the contents of registers EDI, ESI and EBX
    push    esi              // on the stack
    push    ebx

    mov    edi,w             // copy each function parameter to a register
    mov    esi,x
    mov    edx,y
    mov    ebx,z

    add    edi,esi           // w + x
    add    edx,ebx           // y + z

    imul    edi,edx          // (w+x) * (y+z)

    xchg    eax,edi          // 'eXCHanGe' swaps the contents of two registers
                             // in this case EAX and EDI, in other words,
                             // the old value of EAX is now in EDI, and the
                             // old value of EDI is now in EAX, which becomes
                             // the function's return value

    pop    ebx               // Remove the saved values of the registers from
    pop    esi               // the stack, and put them back in the registers
    pop    edi               // We must remove the values in reverse order -
                             // looking at the code we can see that it is
                             // 'symmetrical'. If the values were saved in the
                             // order EDI, ESI, EBX, then they must be removed
                             // in the order EBX, ESI, EDI
end;

In addition to the registers EDI, ESI, EBX, and EBP, the status flag DF (Direction Flag) is expected to be zero (cleared) before and after any call. Just use the CLD instruction if its status is changed within the function.

When writing code in assembly that uses the stack, special attention should be paid to ensuring that the stack pointer ESP is always restored. E.g. if the procedure or function stores something on the stack, then this item must be removed before exiting the function. This time we'll look at an example in MSVC:

// example of an encryption function
void crypt(unsigned char *string)
{
__asm {

    push    edx              // save the contents of register EDX on the stack
    mov    edx,string        // grab the parameter from the stack; in this case
                             // a pointer to the string we must encrypt

    cmp    edx,0            // check whether the parameter is valid
    je    _exit_encrypt     // if invalid, exit the function

_encrypt_loop:

    mov    al,byte ptr[edx] // load the next byte of the string
    cmp    al,0             // check for the end of the string
                            // strings are represented as ASCII; byte 00h
                            // means end-of-string

    je    _exit_encrypt     // once we reach the end of the string, exit

    xor    al,7             // encrypt the byte with a simple xor
    mov    byte ptr[edx],al // store the encrypted byte in the string
    inc    edx              // set the string pointer to point to the next byte

    jmp    _encrypt_loop    // go to the start of the loop so that the
                            // process repeats

_exit_encrypt:

    pop    edx              // IMPORTANT: correct the stack, and restore the
                            // register EDX to its original value
}
}

3.3. Calling functions from assembly

Sometimes in assembly code you will need to call a function written in another language. How is this done? Very simply, a function is called with the instruction call func_name. It is worth noting that there are several ways to call and “clean up” after a function:

Name in C code Parameters Return values Modified registers Info
cdecl cdecl passed on the stack; the parameters are not removed by the function eax, 8 bytes: eax:edx eax, ecx, edx, st(0), st(7), mm0, mm7, xmm0, xmm7 This is the method of calling C library functions, introduced by Microsoft. All system functions on the Linux platform also use this convention
fastcall __fastcall ecx, edx, any remaining parameters are passed on the stack eax, 8 bytes: eax:edx eax, ecx, edx, st(0), st(7), mm0, mm7, xmm0, xmm7 Microsoft introduced this standard, but later switched to the cdecl convention in its products
watcom __declspec (wcall) eax, ebx, ecx, edx eax, 8 bytes: eax:edx eax This function calling convention was introduced by Watcom in their C++ compiler
stdcall __stdcall passed on the stack; parameters are removed by the function eax, 8 bytes: eax:edx eax, ecx, edx, st(0), st(7), mm0, mm7, xmm0, xmm7 The default calling convention for Windows API functions in DLLs
register n/a eax, edx, ecx, any remaining parameters are passed on the stack eax eax, ecx, edx, st(0), st(7), mm0, mm7, xmm0, xmm7 This is the calling convention used in Borland's Delphi

The correct calling convention for functions in our own programs (as opposed to WinApi) often depends on the options with which the program was compiled. In Delphi the default convention is “register”, while for most programs written in C, the default is “cdecl”.

WinApi functions (Windows system functions) use the mechanism stdcall, where function parameters are first stored on the stack, and then the function is called. After the function returns, there is no need to adjust the stack (remove the previously saved parameters), since the called function does it for us. Interestingly, a few WinApi functions do not use the stdcall convention, but instead use cdecl, that is, the parameters are stored on the stack, then the function is called, but afterwards the stack must be cleaned up manually. An example of such a function is the wsprintfA function from the Windows system library user32.dll (whose counterpart in the C standard library is sprintf). The cdecl was probably chosen because these functions do not have a fixed number of parameters:

// global string
unsigned char title[] = "The values of x and y";
...

// this function changes the values x and y into ASCII form, after which
// a message box is displayed showing x and y in their string form
unsigned int int2str(unsigned char *buffer, unsigned int x, unsigned int y)
{
    // local string, accessible only by the function int2str
    unsigned char format[] = "x = %lu\ny = 0x%X\n";

    __asm {

        // Note the way in which the parameters of the function are passed.
        // In C++, the function call would look like this:
        // wsprintf(buffer, "x = %lu\ny = 0x%X\n", x, y);
        // In assembly the parameters are pushed onto the stack in reverse
        // order, after which the function is called.

        push    y            // save y on the stack
        push    x            // save x on the stack
        lea     eax,format   // load the address of the local string into EAX
        push    eax          // save the address of this string on the stack
        push    buffer       // save the pointer to the output buffer, where
                             // the formatted text will end up
        call    wsprintfA    // call this WinApi function
        add    esp,4*4       // clean up the stack - 4*4 = 16 bytes. This is
                             // how much space was taken by the parameters
                             // saved on the stack before the function was called
                             // When writing code e.g. in C++, the compiler
                             // takes care of this for you, but in assembly you
                             // must do this yourself

        push    MB_ICONINFORMATION // specifies the icon that will appear
                             // next to the text in the message box
        push    offset title // the window title (a global variable); we use
                             // the keyword 'offset' because we want to write
                             // the address of the string to the stack
        push    buffer       // the text which will appear in the message box
        push    0            // handle of the parent window
        call    MessageBoxA  // show the message box

    }
}

4. MMX instructions

MMX is the name of an extension to the Pentium series of processors, introduced by Intel. The name is said to be an abbreviation of “MultiMedia eXtensions”, but Intel denies this, and has never explained the issue. The MMX extension to the Pentium line of processors includes a set of new instructions (57, to be exact), and 8 additional 64-bit registers.

MMX registers are shared with the FPU registers. This means that you cannot mix FPU (Floating Point Unit) instructions with MMX unit instructions otherwise the contents of the registers will be corrupted. MMX instructions can operate on data in SIMD fashion (Single Instruction Multiple Data). This means that one operation can be performed simultaneously on many data items, which is not possible using standard x86 instruction.

MMX instructions are ideal for processing multimedia data, e.g. video, graphics, sound. For example, programs such as DivX or Winamp make intensive use of MMX code. Currently, most processors produced by Intel, AMD and Cyrix possess MMX support.

Although MMX has for quite a few years been practically standard, HLL compilers generally do not generate MMX code (except specialised compilers like VectorC). It seems that the natural solution is to program MMX in assembly.

Writing procedures using MMX can sometimes get a 100% speed increase compared to the original code. This is possible because of the aforementioned SIMD mode. Imagine a situation where we have two tables of 8 bytes, and we want to add corresponding bytes from both tables to each other. In C++ we would do it this way:


unsigned char table1[] = { 0x0A,0x1A,0x2A,0x3A,0x4A,0x5A,0x6A,0x7A };
unsigned char table2[] = { 0xA7,0xA6,0xA5,0xA4,0xA3,0xA2,0xA1,0xA0 };
...

for (int i = 0; i < 8; i++)
{
    table1[i] += table2[i];
}

There's no problem with this, but the operation of adding bytes will be repeated 8 times. Let's look at how this can be done much more efficiently by using MMX:

__asm {

    movq    mm0,qword ptr[table1]    // load 8 bytes from the first table
                                     // into register MM0

    movq    mm1,qword ptr[table2]    // 8 bytes from the second table into MM1
    paddb   mm0,mm1                  // add the bytes from MM1 to MM0
    movq    qword ptr[table1],mm0    // write the result back to table1
}

In total, just one instruction is executed instead of 8 additions. Neat, isn't it? And more importantly, efficient. Here a few examples of graphical functions:

#define IMG_WIDTH 640
#define IMG_HEIGHT 320

...

//
// this function initialises the MMX unit
// it should be called:
// - before using the MMX unit for the first time
// - after using MMX when we intend to make use of the FPU
// - after using the FPU when we intend to make use of MMX
//
void InitMMX()
{
    __asm emms;                      // Empty MultiMedia State;
}                                    // initialises the MMX unit

//
// a fadeout effect of the screen (fullscreen)
//
void fadeout(DWORD *lpScreen,DWORD iRounds)
{
    __asm {

        mov          edx,iRounds     // load the total number of repetitions

        mov          eax,03030303h   // mask for each component of a pixel;
                                     // reducing the value of each RGB
                                     // component gives the impression of a
                                     // fading image

        movd         mm0,eax         // transfer the mask to the lower half
                                     // of register MM0

        punpckldq    mm0,mm0         // copy the mask to the upper half of MM0
                                     // such that its full value becomes
                                     // 0x0303030303030303
                                     // (recall that MM0 is a 64-bit register)

        pxor         mm1,mm1         // zero out register MM1

    _fadeout_max:

        paddb        mm1,mm0         // multiply the mask, which will be
                                     // subtracted from the components of
        dec          edx             // pixels by the number of rounds
        jne          _fadeout_max    //

        mov          eax,lpScreen    // load the pointer to the image buffer
                                     // into register EAX

                                     // the number of pixels divided by 2
                                     // we divide by 2 because by using MMX we
                                     // can process 2 pixels simultaneously
                                     // (MM1 is an 8-byte register, but each
                                     // pixel is only 4 bytes)
        mov          ecx,(IMG_WIDTH*IMG_HEIGHT) / 2

    _clear_screen_2_mmx:
                                     // load 2 pixels from the image buffer
                                     // into MM0
        movq         mm0,qword ptr[eax]
        psubusb      mm0,mm1         // subtract our mask from all components
                                     // (bytes) of those 2 pixels
                                     // Both the mask and the pixels are
                                     // treated as tables of 8 separate bytes
                                     // SIMD-style

                                     // write the 2 modified pixels back to
                                     // the image buffer
        movq         qword ptr[eax],mm0

        add          eax,8           // update the pointer to the image buffer,
                                     // ready for the next 2 pixels

        dec          ecx             // reduce the loop counter (the loop will
                                     // repeat for the number of pixels / 2)
        jne          _clear_screen_2_mmx

    }
}

//
// image negative effect
//
void negative(DWORD *lpScreen)
{
    __asm {

        mov          eax,lpScreen    // load the pointer to the image buffer
                                     // into EAX

                                     // write the pixel count / 4 into ECX,
                                     // since we will process 4 pixels at once
        mov          ecx,(IMG_WIDTH*IMG_HEIGHT) / 4

        pcmpeqb      mm7,mm7         // set register MM7 to 0xFFFFFFFFFFFFFFFF

    _neg_mmx:
                                     // load 2 pixels from the image to MM0
        movq         mm0,qword ptr[eax]
        pxor         mm0,mm7         // XOR-ing with all 1s works like the
                                     // logical 'NOT' function
        movq         qword ptr[eax],mm0

                                     // repeat with the next 2 pixels
        movq         mm0,qword ptr[eax+8]
        pxor         mm0,mm7
        movq         qword ptr[eax+8],mm0

        add          eax,16          // update the pointer to the image
        dec          ecx             // and the loop counter
        jne          _neg_mmx

    }
}

//
// image blur effect
//
void blur(DWORD *lpScreen)
{
    __asm {

        push         esi             // save registers ESI and EDI
        push         edi

        mov          esi,lpScreen    // load the pointer to the image buffer
                                     // into ESI

        mov          ecx,( (IMG_WIDTH*IMG_HEIGHT) - (IMG_WIDTH*8) + 4 )
        mov          eax,IMG_WIDTH*4 // the width of a line in the image
        mov          edx,IMG_WIDTH*8 // the width of two lines

        lea          esi,[esi+eax+4] // set the pointer to the first pixel
                                     // of the second line of the image

        pxor         mm7,mm7         // zero out MM7
        movd         mm0,[esi-4]     // read pixel to the left into MM0

    _blur_more:

        movd         mm1,[esi+4]     // read pixel to the right into MM0

        mov          edx,esi
        sub          edx,eax
        movd         mm2,[edx]       // read pixel above into MM2

        movd         mm3,[esi+eax]   // read pixel below into MM3
        punpcklbw    mm0,mm7         // unpack the components of 4 successive
        punpcklbw    mm1,mm7         // pixels into WORDs
        punpcklbw    mm2,mm7
        punpcklbw    mm3,mm7
        paddusw      mm0,mm1         // add the components of the 4 pixels
        paddusw      mm0,mm2
        paddusw      mm0,mm3
        psrlw        mm0,2           // divide this sum by 4, in this way
                                     // we find the 'average' of the 4 pixels
        packuswb     mm0,mm7         // pack the components (each of which is
                                     // a WORD) back into a single DWORD
        movd         [esi],mm0       // write the pixel to the image buffer
        add          esi,4

        dec          ecx
        jne          _blur_more

        pop          edi
        pop          esi

    }
}

5. When to use assembly

As I mentioned at the beginning of the article, assembly is used mainly where speed is important. When writing an algorithm, we should sometimes stop and ask ourselves whether our program could be enhanced, if at some critical points (for instance in loops, etc.), we were to employ, say, MMX.

Imagine that you just wrote an mp3 encoder, and a competitor did the same, but you used hand-written MMX code which is three times faster than the competition. Which product will users choose, when they can complete a task in 10 minutes instead of 30? The answer is obvious.

Besides being ideal for writing algorithms that require speed, assembly is also used to write particular programs such as EXE-compressors. I'll bet that most people will think of programs like UPX or Aspack, which are used to compress executables. Put simply, if you write a program which occupies let's say 700 kB, when compressed by UPX its size will decrease to approx. 300 kB, but the program will still be in the form of an EXE file, and will be just as functional as before compression. This is achieved by using assembly to write a loader for the code. This is a fragment of code that is stored in the EXE file (almost like a virus), and when you start such a program, the loader decompresses the remainder of the EXE file and allows it to run. Writing a loader in a HLL, whether it be C++, Delphi or even Power Basic is virtually impossible.

It can be said that assembly programming is only useful for speed and unusual applications, but this is not entirely true. Writing in assembly language can be more than just inline routines and a few procedures here and there. Entire programs can be written in assembly language! Sometimes I hear people say that it is impossible; that you can't write large applications in assembly from scratch. Often these are people who have only dabbled in assembly for a few hours. If you are a competent programmer, there is nothing stopping you from building professional applications in assembly language. Writing programs in assembly gives us full control over them. Everything is up to us, the program is executed according to our will, and we are not at the mercy of the compiler.

These days, writing in assembly is reasonably simple and convenient. A lot of people around the world are beginning to see the magic of this language. People are creating many projects; you can find a whole bunch of sample tutorials and source code, thanks to which many challenges have ceased to be problems. Writing entire applications in assembly also has the advantage that a project with 5MB of source code will be compiled to an executable of approximately 90kB. Compare an application written in Delphi 6, containing 1 window, which takes approx. 300kb compiled, to a program written in assembly language which does exactly the same thing, and works on every Windows release from 95 to XP, with just a 4kb executable. Why the big difference? It's simple: the compiler adds a lot of unnecessary things, “just in case”. Why isn't this made more efficient? We should ask the companies who make compilers.

Despite the fact that assembly can be used for many useful things, it is also used to write malicious programs, such as viruses, ransomware, or exploits, but in the words of Winnie the Pooh, that is a story for another day...

6. Summary

These examples represent only a small range of what is possible with assembly. There is a lot to discover, just as much for me as there is for you, because contrary to what they say, assembly is not dead, it is constantly changing, evolving, giving us possibilities which do not exist in any high-level language. The terms we hear in the press: SSE, SSE2, 3DNow, are not fiction. Everything is out there. We just have to reach for it.

For my part, writing assembly language gives me a feeling of freedom, which I never found when writing in any other language. I hope that your journey into assembly doesn't end with this article!

7. References

www.win32asm.cjb.neta page for assembly programmers, sources, tutorials, forums
www.int80h.orgFreeBSA assembly programming
www.rbthomas.freeserve.co.ukprogramming Windows graphics, algorithms, fractals
www.chrisdragan.orgChris Dragan's page, many samples in assembler (MMX)
www.azillionmonkeys.com/qed/index.htmlan excellent articles about low level code optimization (MMX, Pentium)
asmjournal.freeservers.comAssembly Programming Journal, a computer programming magazine for the assembler language, C libraries code optimization, assembly programming for Unix shells, game programming in assembly with DirectX and many other interesting resources
www.nasm.usan official page for the free NASM assembler framework (Windows, Unix)
www.borland.com/Products/Software-Testing/Automated-Testing/Devpartner-StudioSoftIce, debugger that let you analyze any application on high and low level formats