Partial register stall
From Ferrous Moon Research
| This article was written by Tycho. |
| This document is classified ICARS::0, and is for public inspection. |
In Intel P6 processors, there is a flaw referred to as a partial register stall. A partial register stall happens when a large load is needed after a series of small stores to the same area. Partial stall occurs when you write part of a register and read the full register. For example, write a value to a 16-bit AX register and later on read in the full 32-bit EAX. This will cause partial stall since AX is the subset of EAX. Partial stalls will slow the performance of an application down since it forces the execution engine to add an additional micro-op that assembles the different parts of the register.
Contents |
Example
Take this simple C function (designed to test for a zero byte) for example:
int foo ( unsigned char _ch )
{
return ( _ch == 0 ) ? 1 : -2;
}
Using the latest Microsoft Visual C++ 2005 compiler, the following code is generated:
00000 8a 44 24 04 mov al, BYTE PTR __ch$[esp-4] 00004 f6 d8 neg al 00006 1b c0 sbb eax, eax 00008 83 e0 fd and eax, -3 0000b 83 c0 01 add eax, 1
The only difference between code generated by Visual C++ 2005 and Visual C++ 6.0 is that 'add eax, 1' is replaced by 'inc eax' (why Microsoft chose to create larger code with Visual C++ 2005, I have no idea).
And with the latest GNU C Compiler v4.1.1:
cmp BYTE PTR [%esp+4], 1 sbb %eax, %eax and %eax, 3 sub %eax, 2 ret
Note that in the first two cases, the SBB instruction is used after accessing a partial register (in this case, AL). This will cause a partial register stall, and the processor will be forced to flush the pipeline, redecode the instructions, and then try again.
In the case of the GCC code, the SBB line may or may not stall depending on when the EAX register was last updated and whether all or part of it was updated.
Avoiding
The safe way to avoid partial stall is to always write to the 32-bit register or operate on the full register before the partial update. The following code, generated by the Intel C++ Compiler v9.1, is much faster (but nearly 2x as large) than other compiled code on this page and avoids the partial register stall:
00000 0f b6 54 24 04 movzx edx, BYTE PTR [esp+4] 00005 b9 01 00 00 00 mov ecx, 1 0000a b8 fe ff ff ff mov eax, -2 0000f 83 fa 00 cmp edx, 0 00012 0f 44 c1 cmove eax, ecx 00015 c3 ret
Comparison
It's very difficult to give a cycle-by-cycle comparison without a simulation and without the ability to write a kernel-level driver to monitor at the hardware level. It is possible, however, to show relatively how much faster or slower these are. The basic code structure for such a benchmark is this (written using the CrissCross framework):
unsigned int i;
double start, finish, diff;
start = GetHighResTime();
start = GetHighResTime(); // Called twice to eliminate any potential slowdown caused by first-time calling.
for ( i = 0; i < 1000000; i++ )
function_to_benchmark();
finish = GetHighResTime();
diff = finish - start;
console->WriteLine ( "%lf :: function_to_benchmark ()", diff );
Setup Procedures
When doing this, we need to replace function_to_benchmark() with the appropriate test function.
We write these functions using inline assembly so that we get the absolute highest performance we can.
For Visual C++ 2005 code:
static inline int
partial_register_stall ( unsigned char _ch )
{
__asm
{
mov al, _ch;
neg al;
sbb eax, eax;
and eax, -3;
add eax, 1;
}
}
GCC (doesn't compile with GCC though, because this doesn't use AT&T syntax):
static inline int
partial_register_stall ( unsigned char _ch )
{
__asm
{
cmp _ch, 1;
sbb eax, eax;
and eax, 3;
sub eax, 2;
}
}
And finally, the Intel C++ Compiler v9.1 code:
static inline int
partial_register_stall ( unsigned char _ch )
{
__asm
{
movzx edx, _ch;
mov ecx, 1;
mov eax, -2;
cmp edx, 0;
cmove eax, ecx;
}
}
Or, if you want to have an easy way to switch between them, you could write it like this:
static inline int
partial_register_stall ( unsigned char _ch )
{
__asm
{
#if 0
// Visual C++ 2005
mov al, _ch;
neg al;
sbb eax, eax;
and eax, -3;
add eax, 1;
#elif 1
// Intel C++ Compiler v9.1
movzx edx, _ch;
mov ecx, 1;
mov eax, -2;
cmp edx, 0;
cmove eax, ecx;
#elif 0
// GCC v4.1.1
cmp _ch, 1;
sbb eax, eax;
and eax, 3;
sub eax, 2;
#endif
}
}
Results
These results are using an Intel Core Duo T2300 (1.66GHz) machine, showing best out of three runs (units are in seconds):
Visual C++ 2005
0.003657 :: partial_register_stall ( 0 ) ( ret == 1 ) 0.003608 :: partial_register_stall ( 1 ) ( ret == -2 )
GCC v4.1.1
0.002812 :: partial_register_stall ( 0 ) ( ret == 1 ) 0.002795 :: partial_register_stall ( 1 ) ( ret == -2 )
Intel C++ Compiler v9.1
0.002469 :: partial_register_stall ( 0 ) ( ret == 1 ) 0.002493 :: partial_register_stall ( 1 ) ( ret == -2 )
