nr: #1 dodano: 2016-12-28 11:12
For the functions above, I got the exact same assembly code (using cmovne).
Sure, some compilers may make that optimization, but it is not guaranteed. It is very possible that you will get different object code for those two ways of writing the function.
In fact, no optimization is guaranteed (although modern optimizing compilers do an impressive job most of the time), so you should either write the code to capture the semantic meaning you intend for it to have, or you should verify the generated object code and write the code to ensure that you are getting the expected output.
Here is what older versions of MSVC will generate when targeting x86-32 (primarily because they don't know to use the CMOV instruction):
cmp BYTE PTR [c], 99
mov eax, DWORD PTR [x]
mov DWORD PTR [eax], 0
jne SHORT LN2
mov DWORD PTR [eax], 15
mov eax, DWORD PTR [x]
xor ecx, ecx
cmp BYTE PTR [c], 99
and ecx, 15
mov DWORD PTR [eax], ecx
test1 gives you branchless code that utilizes the
SETNE instruction (a conditional set, which will set its operand to 0 or 1 based on the condition code—in this case,
NE) in conjunction with some bit-manipulation to produce the correct value.
test0 uses a conditional branch to skip over the assignment of 15 to
The reason this is interesting is because it is almost exactly the opposite of what you might expect. Naïvely, one might expect that
test0 would be the way you'd hold the optimizer's hand and get it to generate branchless code. At least, that's the first thought that went through my head. But in fact, that is not the case! The optimizer is able to recognize the
else idiom and optimize accordingly! It is not able to make that same optimization in the case of
test0, where you tried to outsmart it.
However when adding an extra variable ... The assembly suddenly becomes different
Well, no surprise there. A small change in the code can often have a significant effect on the emitted code. Optimizers are not magic; they are just really complex pattern matchers. You changed the pattern!
Granted, an optimizing compiler could have used two conditional moves here to generate branchless code. In fact, that is precisely what Clang 3.9 does for
test3 (but not for
test2, consistent with our above analysis showing that optimizers may be better able to recognize standard patterns than unusual ones). But GCC doesn't do this. Again, there is no guarantee of a particular optimization being performed.
It seems that the only difference is if the top "mov"s are done before or after the "je".
Now (sorry my assembly is a bit crude), isn't it always better to have the movs after the jump, in order to save pipeline flushes?
No, not really. That would not improve the code in this case. If the branch is mispredicted, you're going to have a pipeline flush no matter what. It doesn't much matter whether the speculatively mispredicted code is a
ret instruction or if it is a
The only reason it would matter that a
ret instruction immediately followed a conditional branch is if you were writing the assembly code by hand and didn't know to use a
rep ret instruction. This is a trick necessary for certain AMD processors that avoids a branch-prediction penalty. Unless you were an assembly guru, you probably wouldn't have known this trick. But the compiler does, and also knows it is not necessary when you're specifically targeting an Intel processor or different generation of AMD processor that doesn't have this quirk.
However, you might be right about it being better to have the
movs after the branch, but not for the reason you suggested. Modern processors (I believe this is Nehalem and later, but I'd look it up in Agner Fog's excellent optimization guides if I needed to verify) are capable of macro-op fusion under certain circumstances. Basically, macro-op fusion means that the CPU's decoder will combine two eligible instructions into one micro-op, saving bandwidth at all stages of the pipeline. A
test instruction followed by a conditional branch instruction, as you see in
test3, is eligible for macro-op fusion (actually, there are other conditions that must be met, but this code does meet those requirements). Scheduling other instructions in between the
je, as you see in
test2, makes macro-op fusion impossible, potentially making the code execute more slowly.
Arguably, though, this is an optimization defect in the compiler. It could have reordered the
mov instructions to place the
je immediately after the
cmp, preserving the ability for macro-op fusion:
test2a(char, int*, int*):
mov DWORD PTR [rsi], 0 ; do the default initialization *first*
mov DWORD PTR [rdx], 0
cmp dil, 99 ; this is now followed immediately by the conditional
je .L10 ; branch, making macro-op fusion possible
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
Another difference between the object code for
test3 is code size. Thanks to padding that is emitted by the optimizer to align the branch target, the code for
test3 is 4 bytes larger than
test2. Very unlikely that is enough difference to matter, though, especially if this code is not being executed within a tight loop where it is guaranteed to be hot in the cache.
So, does that mean you should always write the code as you did in
Well, no, for several reasons:
- As we have seen, it might be a pessimization, since the optimizer may not recognize the pattern.
- You should write code for readability and semantic correctness first, only going back to optimize it when your profiler indicates that it is actually a bottleneck. And then, you should only optimize after inspecting and verifying the object code emitted by your compiler, otherwise you could end up with a pessimization. (The standard "trust your compiler until proven otherwise" advice.)
Even though it may be optimal in certain very simple cases, the "preset" idiom is not generalizable. If your initialization is time-consuming, it may be faster to skip over it when possible. (There is one example discussed here, in the context of VB 6, where string-manipulation is so slow that eliding it when possible actually results in faster execution time than fancy branchless code. More generally, the same rationale would apply if you were able to branch around a function call.)
Even here, where it appears to result in very simple and possibly more optimal code, it may actually be slower because you are writing to memory twice in the case where
c is equal to 99, and saving nothing in the case where
c is not equal to 99.
You might save this cost by rewriting the code such that it accumulates the final value in a temporary register, only storing it to memory at the end, e.g.:
test2b(char, int*, int*):
xor eax, eax ; pre-zero the EAX register
xor ecx, ecx ; pre-zero the ECX register
cmp dil, 99
mov eax, 15 ; change the value in EAX if necessary
mov ecx, 21 ; change the value in ECX if necessary
mov DWORD PTR [rsi], eax ; store our final temp values to memory
mov DWORD PTR [rdx], ecx
but this clobbers two additional registers (
ecx) and may not actually be faster. You'd have to benchmark it. Or trust the compiler to emit this code when it is actually optimal, such as when it has inlined a function like
test2 within a tight loop.
Even if you could guarantee that writing the code in a certain way would cause the compiler to emit branchless code, this would not necessarily be faster! While branches are slow when they are mispredicted, mispredictions are actually quite rare. Modern processors have extremely good branch prediction engines, achieving prediction accuracies of greater than 99% in most cases.
Conditional moves are great for avoiding branch mispredictions, but they have the important disadvantage of increasing the length of a dependency chain. By contrast, a correctly predicted branch breaks the dependency chain. (This is probably why GCC doesn't emit two CMOV instructions when you add the extra variable.) A conditional move is only a performance win if you expect branch prediction to fail. If you can count on a prediction success rate of ~75% or better, a conditional branch is probably faster, because it breaks the dependency chain and has a lower latency. And I would suspect that would be the case here, unless
c alternates rapidly back and forth between 99 and not-99 each time the function is called. (See Agner Fog's "Optimizing subroutines in assembly language", pp 70–71.)