RFC20


Branching vs cmov vs gcc vs clang

I recently solved a performance riddle with some binary tree code I’m working on. For fun I’m writing all sorts of different binary tree variations. I’m primarily using gcc, but I occasionally spot check things using clang. The relative performance between each implementation was unexpectedly large. I was also observing inexplicably good performance improvements by merely switching to clang. The improvement switching to clang was well above anything I was expecting: 20%, 30%, even nearly 50%. Why would clang be generating code that performed so well compared to gcc?

TLDR: branches vs branchless code. In this project clang often generates code that uses cmov or is otherwise branchless, whereas gcc generates code that uses branches for the same source. Across my different tree implementations, ostensibly similar code can induce gcc to choose branchless or branched code, resulting in large performance differences. In the particular cases I’m looking at, branchless is better.

Example 1: An array by any other name

Consider a node definition:

struct Node {
    int key;
    Node* left;
    Node* right;
};

and a function using it:

1
2
3
4
5
6
7
Node* GetInsertEqualPos(Node* x, Node* y, int key) {
    while (x != nullptr) {
        y = x;
        x = (key < x->key) ? x->left : x->right;
    }
    return y;
}

This function is finding the correct parent under which to insert a new node holding a key. In this case both gcc and clang generate a tight “branchless” loop, which is to say that the conditional assignment on line 4 is performed without an explicit “jump” in code.

The code gcc generates is below. The tight loop is labeled .L6. The functions x var is in the rcx register, y in rax. The code loads x->right into rcx unconditionally, then uses cmov to overwrite that with x->left based on the key comparison. The only branch the processor attempts to predict is jne .L6, which it will tend to successfully predict as this code will usually loop multiple times.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
GetInsertEqualPos(Node*, Node*, int):
  mov rax, rdi
  test rdi, rdi
  jne .L4
  jmp .L8
.L6:
  mov rax, rcx
.L4:
  mov rcx, QWORD PTR [rax+16]
  cmp edx, DWORD PTR [rax]
  cmovl rcx, QWORD PTR [rax+8]
  test rcx, rcx
  jne .L6
  ret
.L8:
  mov rax, rsi
  ret

Clang generates code that is different and perhaps even clever. The register assignments change to x in rdi and y is still in rdx. The counter register, referred to here by its names ecx and cl, is used as an index into the Node struct, picking either the left or right member based on the comparison.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
GetInsertEqualPos(Node*, Node*, int):
  mov rax, rsi
  test rdi, rdi
  je .LBB0_3
.LBB0_1:
  mov rax, rdi
  xor ecx, ecx
  cmp dword ptr [rdi], edx
  setle cl
  mov rdi, qword ptr [rdi + 8*rcx + 8]
  test rdi, rdi
  jne .LBB0_1
.LBB0_3:
  ret

This code can be seen on godbolt here.

When originally writing some of the binary tree code I had an idea similar to what clang is doing above, so my Node struct began life with a child[2] array like this:

struct Node {
    int key;
    Node* child[2];
};

And the GetInsertEqualPos function was written like this:

Node* GetInsertEqualPos(Node* x, Node* y, int key) {
    while (x != nullptr) {
        y = x;
        x = x->child[!(key < x->key)];
    }
    return y;
}

With this change both gcc and clang generate nearly identical code, similar to the clang example above, as can be seen on godbolt here.

At some point I wanted to make the code more readable, so I introduced helpers and then modified the code to read them:

struct Node {
    int key;
    Node* child[2];
};

Node* left(Node* x) { return x->child[0]; }
Node* right(Node* x) { return x->child[1]; }

Node* GetInsertEqualPos(Node* x, Node* y, int key) {
    while (x != nullptr) {
        y = x;
        x = (key < x->key) ? left(x) : right(x);
    }
    return y;
}

With this, clang generated essentially the same code as it has in every example so far, but gcc generated yet another variation that used explicit branches for everything. This can be seen on godbolt here, or below:

GetInsertEqualPos(Node*, Node*, int):
  mov rax, rdi
  test rdi, rdi
  jne .L5
  jmp .L9
.L11:
  mov rcx, QWORD PTR [rax+8]
  test rcx, rcx
  je .L10
.L7:
  mov rax, rcx
.L5:
  cmp DWORD PTR [rax], edx
  jg .L11
  mov rcx, QWORD PTR [rax+16]
  test rcx, rcx
  jne .L7
.L10:
  ret
.L9:
  mov rax, rsi
  ret

And it was this variation that was abysmally slow. Binary trees take the left or right path nearly randomly. The processor was unable to successfully predict them, and so the code pays the cost of a branch misprediction quite often. In my benchmarks this penalty amounted to about a 20-30% performance loss.

I switched back to explicit Node* left; and Node* right; fields and gcc now generated branchless code. Et voilà, my Red-Black tree was now on par C++’s std::multimap when compiled with gcc.

Example 2: The lower bounds of sadness

This is an example of a function that clang seems to happily compile to branch free code, but gcc does not. I can’t figure out how to get gcc to use branch free code, hence the sadness. This function returns the “lower bound” for a key, is the first node with a key greater than or equal to a given key. The code is quite similar to GetInsertEqualPos except that there is another conditionally assigned variable in the loop. Using the same Node definition, the code looks like this:

Node* LowerBound(Node* x, int key) {
  Node* lower = nullptr;
  while (x != nullptr) {
    if (!(x->key < key)) {
      lower = x;
      x = x->left;
    } else {
      x = x->right;
    }
  }
  return lower;
}

Now, if we compile this code with clang and gcc on godbolt we get this for gcc:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
LowerBound(Node*, int):
  xor ecx, ecx
.L8:
  test rdi, rdi
  je .L1
.L10:
  mov rax, QWORD PTR [rdi]    # rax = node->left
  mov rdx, QWORD PTR [rdi+8]  # rdx = node->right
  cmp esi, DWORD PTR [rdi+16] # cmp key and node->key
  jle .L6
  mov rdi, rdx                # x = x->right
  test rdi, rdi
  jne .L10                    # goto .L10 if x != 0
.L1:
  mov rax, rcx
  ret
.L6:
  mov rcx, rdi                # lower = x
  mov rdi, rax                # x = x->left
  jmp .L8

Here, gcc is using comparison tests and branches for everything, placing the code for the lower = x branch under the .L6 label. The branch that the CPU simply can’t predict reliably is the jle .L6 on line 10, since the binary tree search tends to zig zag arbitrarily down the tree.

Clang chooses cmov for both assignments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
LowerBound(Node*, int):
  xor eax, eax
  test rdi, rdi
  je .LBB0_3
.LBB0_1:
  lea rcx, [rdi + 8]
  cmp dword ptr [rdi + 16], esi
  cmovge rcx, rdi
  cmovge rax, rdi
  mov rdi, qword ptr [rcx]
  test rdi, rdi
  jne .LBB0_1
.LBB0_3:
  ret

Like the GetInsertEqualPos function, LowerBound performs better with branch free code. So, let’s look at one way to force the issue:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
template <typename Integer>
Integer Select(bool condition, Integer a, Integer b) {
  using Unsigned = typename std::make_unsigned<Integer>::type;
  Unsigned mask = condition ? static_cast<Unsigned>(-1) : 0;
  return (mask & a) | (~mask & b);
}

template <typename T>
T* Select(bool condition, T* a, T* b) {
  auto au = reinterpret_cast<std::uintptr_t>(a);
  auto bu = reinterpret_cast<std::uintptr_t>(b);
  auto ru = Select(condition, au, bu);
  return reinterpret_cast<T*>(ru);
}

// LowerBound returns the first node in
// the tree rooted at "x" whose key is
// not less than "key", or null if there
// is no such key.
Node* LowerBound(Node* x, int key) {
  Node* lower = nullptr;
  while (x != nullptr) {
    bool cond = !(x->key < key);
    lower = Select(cond, x, lower);
    x = Select(cond, x->left, x->right);
  }
  return lower;
}

What have we done here? Branch free code using bit hackery. The highlighted lines in the first select template contain all the trickery, but the basic idea is to turn a boolean into a bitmask that is either zero or all ones, then use the bitmask to choose one of two integers.

Gcc generates branch free code (godbolt link):

LowerBound(Node*, int):
  xor ecx, ecx
  test rdi, rdi
  je .L1
.L3:
  xor eax, eax
  cmp DWORD PTR [rdi+16], esi
  mov rdx, rdi
  mov r8, QWORD PTR [rdi]
  setge al
  xor rdx, rcx
  neg rax
  and rdx, rax
  xor rcx, rdx
  mov rdx, QWORD PTR [rdi+8]
  xor r8, rdx
  mov rdi, rdx
  and rax, r8
  xor rdi, rax
  cmp rdx, rax
  jne .L3
.L1:
  mov rax, rcx
  ret

So, I guess that is good? We’ll see soon. First, the cool part. Clang has generated identical code for this source, using cmov, as it did with the straightforward source written with if statements. Clang essentially de-obfuscated the bit twiddling done in the new Select function, figured out its final effect, and rewrite semantically equivalent code using cmov.

LowerBound(Node*, int): # @LowerBound(Node*, int)
  xor eax, eax
  test rdi, rdi
  je .LBB0_3
.LBB0_1: # =>This Inner Loop Header: Depth=1
  lea rcx, [rdi + 8]
  cmp dword ptr [rdi + 16], esi
  cmovge rax, rdi
  cmovge rcx, rdi
  mov rdi, qword ptr [rcx]
  test rdi, rdi
  jne .LBB0_1
.LBB0_3:
  ret

Now, as we saw, gcc didn’t do quite as well. It did generate branch free code. How “bad” was the branchy code before, and how “good” is the new code. I wrote a benchmark and, well, it depends. Generally, if the binary tree fits entirely within the L1 cache the branchy code does better, presumably because the penalty for branch mispredictions in L1 cache is fairly low. Once the tree is larger than the L1 cache, the branch free code begins to win big.

HeightNodesBranchy GCCBranchless GccChangeNotes
82557ns8.3ns+18%
1010238.5ns11ns+33%fits in L1 cache
1120479ns14ns+49%2x bigger than L1 cache
12409540ns17ns-56%4x bigger than L1 cache
141638360ns28ns-54%fits in L2 cache
1665535107ns53ns-49%
18262143114ns60ns-47%
201048576148ns85ns-42%

How does this new “Branchless Gcc” approach compare against clang, which uses cmov? Clang wins in all cases, with gcc being between 35% and 55% worse.

HeightNodesClangBranchless GccChangeNotes
82555ns8.3ns+55%
1010237ns11ns+55%fits in L1 cache
1120479ns14ns+49%2x bigger than L1 cache
12409511.5ns17ns+52%4x bigger than L1 cache
141638320ns28ns+44%fits in L2 cache
166553537ns53ns+45%
1826214342ns60ns+41%
20104857663ns85ns+35%

Lessons

First, don’t listen to the internet without thinking. There are tons of resources on the internet telling you things like:

  • The compiler is so good at optimization that you should just ignore what it does, write simple code, and hope for the best.
  • “Branchless” code is usually worse because it introduces a “data dependency” and therefore defeats the processor’s ability to execute things out of order. Processors are now quite good at things like out of order execution, speculative branch prediction, etc.
  • You’ll find people saying that clang is more likely to generate code with branches, whereas gcc is more likely to use cmov.

Lies, all lies! It is only through careful benchmarking and performance analysis that you can really know what is going on with your program. Learn to effectively use tools like linux perf, objdump, and godbolt.org. Learn to write and run benchmarks that are unlikely to lie to you. Doing this is a black art; see LLVM’s benchmarking tips for just a taste.

Lies, all lies? Not really. What I’ve stumbled upon here is an edge case, where code is predictably not predictable, which makes the processor’s branch prediction predictably bad. Most branches in most code are fairly predictable, so the conventional wisdom is actually correct. What you’ve got to watch out for is how to recognize when you’ve actually got an edge case on your hands, and what to do about it.

Further musing

Krister Walfridsson writes about this same issue in his blog post “Branches/cmov and compiler optimizations”. He goes into more detail about the interaction between compiler optimizations and cmov. I thought this point of his was particularly interesting:

Many (most?) cases where GCC incorrectly decides to use a branch instead of emitting branchless code comes from optimization passes transforming the code to a form where the backend cannot change it to use conditional moves.

Check out Linux Torvald’s explanation of why cmov is “generally a bad idea on an aggressively out-of-order CPU” (backup link, and another backup link). His explanation of why predictable branches can be effectively free is succinct and clear.

Some search terms that turn up useful related stuff on the web:

  • branchless
  • predicated instruction
  • gcc’s command line flags: -fno-if-conversion -fno-if-conversion2 -fno-tree-loop-if-convert (see docs).