Main Page | Recent changes | View source | Page history

Printable version | Disclaimers | Privacy policy

Not logged in
Log in | Help
 

Partial register stall

From Ferrous Moon Research

This article was written by Tycho.
This document is classified ICARS::0, and is for public inspection.


In Intel P6 processors, there is a flaw referred to as a partial register stall. A partial register stall happens when a large load is needed after a series of small stores to the same area. Partial stall occurs when you write part of a register and read the full register. For example, write a value to a 16-bit AX register and later on read in the full 32-bit EAX. This will cause partial stall since AX is the subset of EAX. Partial stalls will slow the performance of an application down since it forces the execution engine to add an additional micro-op that assembles the different parts of the register.

Contents

Example

Take this simple C function (designed to test for a zero byte) for example:

int foo ( unsigned char _ch )
{
    return ( _ch == 0 ) ? 1 : -2;
}

Using the latest Microsoft Visual C++ 2005 compiler, the following code is generated:

 00000	8a 44 24 04	 mov	 al, BYTE PTR __ch$[esp-4]
 00004	f6 d8		 neg	 al
 00006	1b c0		 sbb	 eax, eax
 00008	83 e0 fd	 and	 eax, -3
 0000b	83 c0 01	 add	 eax, 1

The only difference between code generated by Visual C++ 2005 and Visual C++ 6.0 is that 'add eax, 1' is replaced by 'inc eax' (why Microsoft chose to create larger code with Visual C++ 2005, I have no idea).

And with the latest GNU C Compiler v4.1.1:

 cmp     BYTE PTR [%esp+4], 1
 sbb     %eax, %eax
 and     %eax, 3
 sub     %eax, 2
 ret

Note that in the first two cases, the SBB instruction is used after accessing a partial register (in this case, AL). This will cause a partial register stall, and the processor will be forced to flush the pipeline, redecode the instructions, and then try again.

In the case of the GCC code, the SBB line may or may not stall depending on when the EAX register was last updated and whether all or part of it was updated.

Avoiding

The safe way to avoid partial stall is to always write to the 32-bit register or operate on the full register before the partial update. The following code, generated by the Intel C++ Compiler v9.1, is much faster (but nearly 2x as large) than other compiled code on this page and avoids the partial register stall:

 00000 0f b6 54 24 04   movzx edx, BYTE PTR [esp+4]
 00005 b9 01 00 00 00   mov ecx, 1
 0000a b8 fe ff ff ff   mov eax, -2
 0000f 83 fa 00         cmp edx, 0
 00012 0f 44 c1         cmove eax, ecx
 00015 c3               ret

Comparison

It's very difficult to give a cycle-by-cycle comparison without a simulation and without the ability to write a kernel-level driver to monitor at the hardware level. It is possible, however, to show relatively how much faster or slower these are. The basic code structure for such a benchmark is this (written using the CrissCross framework):

unsigned int i;
double start, finish, diff;
start = GetHighResTime();
start = GetHighResTime(); // Called twice to eliminate any potential slowdown caused by first-time calling.

for ( i = 0; i < 1000000; i++ )
    function_to_benchmark();

finish = GetHighResTime();
diff = finish - start;
console->WriteLine ( "%lf :: function_to_benchmark ()", diff );

Setup Procedures

When doing this, we need to replace function_to_benchmark() with the appropriate test function.

We write these functions using inline assembly so that we get the absolute highest performance we can.

For Visual C++ 2005 code:

static inline int
partial_register_stall ( unsigned char _ch )
{
    __asm
    {
        mov al, _ch;
        neg al;
        sbb eax, eax;
        and eax, -3;
        add eax, 1;
    }
}

GCC (doesn't compile with GCC though, because this doesn't use AT&T syntax):

static inline int
partial_register_stall ( unsigned char _ch )
{
    __asm
    {
        cmp _ch, 1;
        sbb eax, eax;
        and eax, 3;
        sub eax, 2;
    }
}

And finally, the Intel C++ Compiler v9.1 code:

static inline int
partial_register_stall ( unsigned char _ch )
{
    __asm
    {
        movzx edx, _ch;
        mov ecx, 1;
        mov eax, -2;
        cmp edx, 0;
        cmove eax, ecx;
    }
}

Or, if you want to have an easy way to switch between them, you could write it like this:

static inline int
partial_register_stall ( unsigned char _ch )
{
    __asm
    {
#if 0
        // Visual C++ 2005
        mov al, _ch;
        neg al;
        sbb eax, eax;
        and eax, -3;
        add eax, 1;
#elif 1
        // Intel C++ Compiler v9.1
        movzx edx, _ch;
        mov ecx, 1;
        mov eax, -2;
        cmp edx, 0;
        cmove eax, ecx;
#elif 0
        // GCC v4.1.1
        cmp _ch, 1;
        sbb eax, eax;
        and eax, 3;
        sub eax, 2;
#endif
    }
}

Results

These results are using an Intel Core Duo T2300 (1.66GHz) machine, showing best out of three runs (units are in seconds):

Visual C++ 2005

0.003657 :: partial_register_stall ( 0 ) ( ret == 1 )
0.003608 :: partial_register_stall ( 1 ) ( ret == -2 )

GCC v4.1.1

0.002812 :: partial_register_stall ( 0 ) ( ret == 1 )
0.002795 :: partial_register_stall ( 1 ) ( ret == -2 )

Intel C++ Compiler v9.1

0.002469 :: partial_register_stall ( 0 ) ( ret == 1 )
0.002493 :: partial_register_stall ( 1 ) ( ret == -2 )


[Main Page]
Main page
Products
Forums
Bugzilla
Recent changes
Random page
Help

View source
Discuss this page
Page history
What links here
Related changes

Special pages