Question: Is if else faster than if + default?

Question

Is if else faster than if + default?

Answers 1
Added at 2016-12-28 10:12
Tags
Question

I conducted a simple experiment, to compare if-else to only an if (with default values preset). Example:

void test0(char c, int *x) {
    *x = 0;
    if (c == 99) {
        *x = 15;
    }
}

void test1(char c, int *x) {
    if (c == 99) {
        *x = 15;
    } else {
        *x = 0;
    }
}

For the functions above, I got the exact same assembly code (using cmovne).

However when adding an extra variable:

void test2(char c, int *x, int *y) {
    *x = 0;
    *y = 0;
    if (c == 99) {
        *x = 15;
        *y = 21;
    }
}

void test3(char c, int *x, int *y) {
    if (c == 99) {
        *x = 15;
        *y = 21;
    } else {
        *x = 0;
        *y = 0;
    }
}

The assembly suddenly becomes different:

test2(char, int*, int*):
        cmp     dil, 99
        mov     DWORD PTR [rsi], 0
        mov     DWORD PTR [rdx], 0
        je      .L10
        rep ret
.L10:
        mov     DWORD PTR [rsi], 15
        mov     DWORD PTR [rdx], 21
        ret
test3(char, int*, int*):
        cmp     dil, 99
        je      .L14
        mov     DWORD PTR [rsi], 0
        mov     DWORD PTR [rdx], 0
        ret
.L14:
        mov     DWORD PTR [rsi], 15
        mov     DWORD PTR [rdx], 21
        ret

It seems that the only difference is if the top movs are done before or after the je.

Now (sorry my assembly is a bit crude), isn't it always better to have the movs after the jump, in order to save pipeline flushes? And if so why wouldn't the optimizer (gcc6.2 -O3) use the better method?

Answers
nr: #1 dodano: 2016-12-28 11:12

For the functions above, I got the exact same assembly code (using cmovne).

Sure, some compilers may make that optimization, but it is not guaranteed. It is very possible that you will get different object code for those two ways of writing the function.

In fact, no optimization is guaranteed (although modern optimizing compilers do an impressive job most of the time), so you should either write the code to capture the semantic meaning you intend for it to have, or you should verify the generated object code and write the code to ensure that you are getting the expected output.

Here is what older versions of MSVC will generate when targeting x86-32 (primarily because they don't know to use the CMOV instruction):

test0 PROC
    cmp      BYTE PTR [c], 99
    mov      eax, DWORD PTR [x]
    mov      DWORD PTR [eax], 0
    jne      SHORT LN2
    mov      DWORD PTR [eax], 15
LN2:
    ret      0
test0 ENDP
test1 PROC
    mov      eax, DWORD PTR [x]
    xor      ecx, ecx
    cmp      BYTE PTR [c], 99
    setne    cl
    dec      ecx
    and      ecx, 15
    mov      DWORD PTR [eax], ecx
    ret      0
test1 ENDP

Note that test1 gives you branchless code that utilizes the SETNE instruction (a conditional set, which will set its operand to 0 or 1 based on the condition code—in this case, NE) in conjunction with some bit-manipulation to produce the correct value. test0 uses a conditional branch to skip over the assignment of 15 to *x.

The reason this is interesting is because it is almost exactly the opposite of what you might expect. Naïvely, one might expect that test0 would be the way you'd hold the optimizer's hand and get it to generate branchless code. At least, that's the first thought that went through my head. But in fact, that is not the case! The optimizer is able to recognize the if/else idiom and optimize accordingly! It is not able to make that same optimization in the case of test0, where you tried to outsmart it.

However when adding an extra variable ... The assembly suddenly becomes different

Well, no surprise there. A small change in the code can often have a significant effect on the emitted code. Optimizers are not magic; they are just really complex pattern matchers. You changed the pattern!

Granted, an optimizing compiler could have used two conditional moves here to generate branchless code. In fact, that is precisely what Clang 3.9 does for test3 (but not for test2, consistent with our above analysis showing that optimizers may be better able to recognize standard patterns than unusual ones). But GCC doesn't do this. Again, there is no guarantee of a particular optimization being performed.

It seems that the only difference is if the top "mov"s are done before or after the "je".

Now (sorry my assembly is a bit crude), isn't it always better to have the movs after the jump, in order to save pipeline flushes?

No, not really. That would not improve the code in this case. If the branch is mispredicted, you're going to have a pipeline flush no matter what. It doesn't much matter whether the speculatively mispredicted code is a ret instruction or if it is a mov instruction.

The only reason it would matter that a ret instruction immediately followed a conditional branch is if you were writing the assembly code by hand and didn't know to use a rep ret instruction. This is a trick necessary for certain AMD processors that avoids a branch-prediction penalty. Unless you were an assembly guru, you probably wouldn't have known this trick. But the compiler does, and also knows it is not necessary when you're specifically targeting an Intel processor or different generation of AMD processor that doesn't have this quirk.

However, you might be right about it being better to have the movs after the branch, but not for the reason you suggested. Modern processors (I believe this is Nehalem and later, but I'd look it up in Agner Fog's excellent optimization guides if I needed to verify) are capable of macro-op fusion under certain circumstances. Basically, macro-op fusion means that the CPU's decoder will combine two eligible instructions into one micro-op, saving bandwidth at all stages of the pipeline. A cmp or test instruction followed by a conditional branch instruction, as you see in test3, is eligible for macro-op fusion (actually, there are other conditions that must be met, but this code does meet those requirements). Scheduling other instructions in between the cmp and je, as you see in test2, makes macro-op fusion impossible, potentially making the code execute more slowly.

Arguably, though, this is an optimization defect in the compiler. It could have reordered the mov instructions to place the je immediately after the cmp, preserving the ability for macro-op fusion:

test2a(char, int*, int*):
    mov     DWORD PTR [rsi], 0    ; do the default initialization *first*
    mov     DWORD PTR [rdx], 0
    cmp     dil, 99               ; this is now followed immediately by the conditional
    je      .L10                  ;  branch, making macro-op fusion possible
    rep ret
.L10:
    mov     DWORD PTR [rsi], 15
    mov     DWORD PTR [rdx], 21
    ret

Another difference between the object code for test2 and test3 is code size. Thanks to padding that is emitted by the optimizer to align the branch target, the code for test3 is 4 bytes larger than test2. Very unlikely that is enough difference to matter, though, especially if this code is not being executed within a tight loop where it is guaranteed to be hot in the cache.

So, does that mean you should always write the code as you did in test2?
Well, no, for several reasons:

  1. As we have seen, it might be a pessimization, since the optimizer may not recognize the pattern.
  2. You should write code for readability and semantic correctness first, only going back to optimize it when your profiler indicates that it is actually a bottleneck. And then, you should only optimize after inspecting and verifying the object code emitted by your compiler, otherwise you could end up with a pessimization. (The standard "trust your compiler until proven otherwise" advice.)
  3. Even though it may be optimal in certain very simple cases, the "preset" idiom is not generalizable. If your initialization is time-consuming, it may be faster to skip over it when possible. (There is one example discussed here, in the context of VB 6, where string-manipulation is so slow that eliding it when possible actually results in faster execution time than fancy branchless code. More generally, the same rationale would apply if you were able to branch around a function call.)

    Even here, where it appears to result in very simple and possibly more optimal code, it may actually be slower because you are writing to memory twice in the case where c is equal to 99, and saving nothing in the case where c is not equal to 99.

    You might save this cost by rewriting the code such that it accumulates the final value in a temporary register, only storing it to memory at the end, e.g.:

    test2b(char, int*, int*):
        xor     eax, eax               ; pre-zero the EAX register
        xor     ecx, ecx               ; pre-zero the ECX register
        cmp     dil, 99
        je      Done
        mov     eax, 15                ; change the value in EAX if necessary
        mov     ecx, 21                ; change the value in ECX if necessary
    Done:
        mov     DWORD PTR [rsi], eax   ; store our final temp values to memory
        mov     DWORD PTR [rdx], ecx
        ret
    

    but this clobbers two additional registers (eax and ecx) and may not actually be faster. You'd have to benchmark it. Or trust the compiler to emit this code when it is actually optimal, such as when it has inlined a function like test2 within a tight loop.

  4. Even if you could guarantee that writing the code in a certain way would cause the compiler to emit branchless code, this would not necessarily be faster! While branches are slow when they are mispredicted, mispredictions are actually quite rare. Modern processors have extremely good branch prediction engines, achieving prediction accuracies of greater than 99% in most cases.

    Conditional moves are great for avoiding branch mispredictions, but they have the important disadvantage of increasing the length of a dependency chain. By contrast, a correctly predicted branch breaks the dependency chain. (This is probably why GCC doesn't emit two CMOV instructions when you add the extra variable.) A conditional move is only a performance win if you expect branch prediction to fail. If you can count on a prediction success rate of ~75% or better, a conditional branch is probably faster, because it breaks the dependency chain and has a lower latency. And I would suspect that would be the case here, unless c alternates rapidly back and forth between 99 and not-99 each time the function is called. (See Agner Fog's "Optimizing subroutines in assembly language", pp 70–71.)

Source Show
◀ Wstecz