optimizing acmp in L-World

Thu Jul 26 01:14:51 UTC 2018

I've measured new Valhalla acmp code on legacy (non-value types code).
Performance regression of acmp operation is 10%-20% comparing to before 
valhalla world. It was measured on microbenchmarks and looks like it 
shouldn't be significantly visible on real apps except some corner cases.
I to suggest to a small improvement which make sense to try.

I really that trick:
>    bool new_acmp(ptr x, ptr y) {
>      if (x == null)  return y == null;
>      int bit = (x->class_in_header & Q_TYPE_BIT);
>      ptr x1 = x | swar_broadcast_bit(bit != 0);
>      return old_acmp(x1, y);
>    }
>    int swar_broadcast_bit(bool z) {
>      return z ? -1 : 0;  // use ALU ops
>    }
And right now the code checking and mixing if class is value looks like:

mov    0x8(%rcx),%r10d  // load class ptr
shr     $0x3,%r10
and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
add    %r10,%rcx            // mix with value bit

> Suppose there is a branch free technique for implementing is_value_ptr
With the current is_value_ptr scheme I want to suggest branch free 
technique and return to initial idea of codegeneration
> So the logic looks like this, and the new semantics are in
> the second "if" statement:
>
>    bool new_acmp(ptr x, ptr y) {
>      if (!old_acmp(x, y))  return false;  // x != y => false
>      if (is_value_ptr(x))  return false;  // x.is_value => false
>      return true;
>    }
>    bool old_acmp(ptr x, ptr y) {
>      return x == y;  // cmpq etc.
>    }
>
I'd rather rewrite to:

   bool new_acmp(ptr x, ptr y) {
     if (!old_acmp(x, y))  return false;  // x != y => false
     return x!=null && !is_value_ptr(x);
   }

and the following will get us required 0 and 1 from this

mov    0x8(%rcx),%r10d  // load class ptr
shr     $0x3,%r10
and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
xor     $0x1,%r10            // here r10 false if value, true if not value.

Checking currently generated code I'd split it into traces and count asm 
instructions, and count asm instructions which could new in suggested 
codegeneration:
(doesn't count unconditional branches)

case x==null & y==null:
        current code:  3 ALU ops + 2 branches
   suggested code:  3 ALU ops + 2 branches

case x==null & y!=null:
        current code:  3 ALU ops + 2 branches
   suggested code:  2 ALU ops + 1 branches

case x==y:
        current code:  6 ALU ops + 2 branches + 1 load from memory
   suggested code:  5 ALU ops + 2 branches + 1 load from memory

case x!=y:
        current code:  6 ALU ops + 2 branches + 1 load from memory
   suggested code:  2 ALU ops + 1 branch

In some cases acmp cost is the same as in before valhalla world  (2 ALU 
ops + 1 branch)

The most important fact here is not a quite small decrease of ALU 
instructions, which honestly won't be visible even on microbenchmarks, 
but the fact in case of x!=y we eliminate load from memory which even in 
hit to L1 cache will cost as all other instructions and much more in 
case of cache miss.

The final suggested code (simplified):

     cmp rx,ry
     je equal
     xor eax,eax
     jmp end

equal:
     test rx,rx
     je is_null
     mov klass ptr from memory to eax
     shr 3,eax
     and 1,eax
     xor  1,eax
     jmp end

is_null:
    mov 1,eax
    jmp end

end:
   // eax contains result