Thursday, 5 November 2015

Short-circuit evaluation is not always your best friend.

Whilst doing some code reviews the other day I came across a piece of code that (structurally) looked like this:

if (value1 & value2 & value3)
{
    // Do something
}
else
{
    // Do something else.
}

I initially thought this was a harmless typo. If you were trying determine if 3 boolean values happened to all be true this would evaluate the same as if you had used the logical && operator ((0x01 & 0x01 & 0x01) == (true && true && true)).

Fair enough, I thought, but out of interest I wondered how different this would appear at the assembly level, so I opened up Visual Studio's disassembler. The results were very interesting. I created  a simple program that does comparisons on 3 different boolean values from 3 different arrays in a loop and increments one of two counters depending on whether or not all three elements are true:

bool g_Array1[NUM_ELEMENTS];
bool g_Array2[NUM_ELEMENTS];
bool g_Array3[NUM_ELEMENTS];

// Initialize to random values.

int numPassed = 0;
int numFailed = 0;

for (int i = 0; i < NUM_ELEMENTS; i++)
{
    if (g_Array1[i] AND g_Array2[i] AND g_Array3[i])
        numPassed++;
    else
        numFailed++;
}

Let's focus on the inner loop's generated code:

Bitwise AND

 for (int i = 0; i < NUM_ELEMENTS; i++)
008D10AE  xor         ecx,ecx  
 {
  if (g_Array1[i] & g_Array2[i] & g_Array3[i])
008D10B0  mov         al,byte ptr [ecx+9C75B0h]  
008D10B6  and         al,byte ptr [ecx+0ABB7F0h]  
008D10BC  test        byte ptr [ecx+8D3370h],al  
008D10C2  je          main+57h (08D10C7h)  
   numPassed++;
008D10C4  inc         esi  
  else
008D10C5  jmp         main+58h (08D10C8h)  
   numFailed++;
008D10C7  inc         edi  
 for (int i = 0; i < NUM_ELEMENTS; i++)
008D10C8  inc         ecx  
008D10C9  cmp         ecx,0F4240h  
008D10CF  jl          main+40h (08D10B0h)  
 }

This is as one would expect. Load the first boolean value into a register, perform a bitwise AND with the second, and finally test (set the contents of the status register to the result of the AND of) the third boolean value. This is followed by a jump to increment the correct counter.

Logical AND

 for (int i = 0; i < NUM_ELEMENTS; i++)
00F210AE  xor         eax,eax  
 {
  if (g_Array1[i] && g_Array2[i] && g_Array3[i])
00F210B0  cmp         byte ptr [eax+0F23370h],0  
00F210B7  je          main+5Eh (0F210CEh)  
00F210B9  cmp         byte ptr [eax+110B7F0h],0  
00F210C0  je          main+5Eh (0F210CEh)  
00F210C2  cmp         byte ptr [eax+10175B0h],0  
00F210C9  je          main+5Eh (0F210CEh)  
   numPassed++;
00F210CB  inc         esi  
  else
00F210CC  jmp         main+5Fh (0F210CFh)  
   numFailed++;
00F210CE  inc         edi  
 for (int i = 0; i < NUM_ELEMENTS; i++)
00F210CF  inc         eax  
00F210D0  cmp         eax,0F4240h  
00F210D5  jl          main+40h (0F210B0h)  
 }

In this context, we see something strange. The logical AND forces a comparison/potential branch for every argument in the comparison. Now conventional wisdom says to make sure your code is as branch free as possible (Modern high frequency, deeply pipelined CPUs like being able to prefetch and execute instructions well in advance and branching introduces an element of uncertainty into the process), but there is a reason for the code to be this way.

C like languages use an idiom called "short-circuiting" in the evaluation of their logical operator statements. What it entails is that evaluation of the arguments of a logical operation will only happen to the point where the result of the comparison can be determined. For example if you have an AND comparison to test if 100 arguments are all true, and the first argument evaluates to false, there will be no point in wasting time evaluating the other arguments. Likewise for the logical OR operator, which will jump to its "true" branch as soon as one of its arguments evaluates to true. This is a good thing if determining your arguments is a potentially expensive operation. After all, if you use your imagination a little, your code could potentially need to do this:

if (FetchFromDatabase() && LinkedListTraverse() && CreateNewFile())

However, for operations involving simple arguments, short-circuiting isn't optimal.

The only question left is just how much performance could you possibly gain by this kind of micro optimization? I set the size of the arrays in my test program to a million elements, populated with random boolean values. The loop we discussed is performed on them, and timings taken.
The program is so simple, I won't bother including code, just the final results.

For the logical AND evaluation, the time taken in the loop is ~7.10 milliseconds.

For the bitwise AND evaluation, the time taken in the loop drops to ~2.4 milliseconds.

That's almost 3x as fast in this case (Proportional to the number of branches in the loop?).
This code was executed on a desktop grade CPU, so it's not unreasonable to imagine that results on simpler architectures would be higher.