optimizing acmp in L-World

Thu Jul 26 03:45:09 UTC 2018

Good analysis.  I like the point about avoiding the klass load.

If we were to hoist the value-bit into the oop itself, I think the analysis
would go the other way, because the header load wouldn't happen and
the null check would not be needed either.  That could possibly be where
we end up eventually, if we were to *also* hoist klass index info into 64-bit
oops.  We'd have to convince the GC to let us control the hoisted value-bit
(and klass index bits), which will take time.  So I think I buy your suggestion.

A more important limitation on the applicability of your analysis:  The JIT
routinely rearranges the acmp code based on its own static analysis of the
operands, whether they are, are not, may frequently be, or may occasionally
be null and/or a value.

Also, there's a corner that should be rounded off:  The acmp instruction
code is usually targeted to a test, not to the generation of a boolean,
so it would be natural to consider your specific code idioms as generating
a branch to one of two locations; it's only rarely a bit in a register.

If you avoid materializing the bit, you can get rid of instructions that steer
bits down to the LSB, and the xor that flips it.  For an example of taking
an answer as a branch, see MacroAssembler::check_klass_subtype and
subroutines.

When you generate a boolean as a jump, there are always two or three
versions of the code, depending on whether fallthrough is used to signal
true or false.

Specific examples are below, after your code sequence.

— John

On Jul 25, 2018, at 6:14 PM, Sergey Kuksenko <sergey.kuksenko at oracle.com> wrote:
> 
> I've measured new Valhalla acmp code on legacy (non-value types code).
> Performance regression of acmp operation is 10%-20% comparing to before valhalla world. It was measured on microbenchmarks and looks like it shouldn't be significantly visible on real apps except some corner cases.
> I to suggest to a small improvement which make sense to try.
> 
> I really that trick:
>>   bool new_acmp(ptr x, ptr y) {
>>     if (x == null)  return y == null;
>>     int bit = (x->class_in_header & Q_TYPE_BIT);
>>     ptr x1 = x | swar_broadcast_bit(bit != 0);
>>     return old_acmp(x1, y);
>>   }
>>   int swar_broadcast_bit(bool z) {
>>     return z ? -1 : 0;  // use ALU ops
>>   }
> And right now the code checking and mixing if class is value looks like:
> 
> mov    0x8(%rcx),%r10d  // load class ptr
> shr     $0x3,%r10
> and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
> add    %r10,%rcx            // mix with value bit
> 
>> Suppose there is a branch free technique for implementing is_value_ptr
> With the current is_value_ptr scheme I want to suggest branch free technique and return to initial idea of codegeneration
>> So the logic looks like this, and the new semantics are in
>> the second "if" statement:
>> 
>>   bool new_acmp(ptr x, ptr y) {
>>     if (!old_acmp(x, y))  return false;  // x != y => false
>>     if (is_value_ptr(x))  return false;  // x.is_value => false
>>     return true;
>>   }
>>   bool old_acmp(ptr x, ptr y) {
>>     return x == y;  // cmpq etc.
>>   }
>> 
> I'd rather rewrite to:
> 
>  bool new_acmp(ptr x, ptr y) {
>    if (!old_acmp(x, y))  return false;  // x != y => false
>    return x!=null && !is_value_ptr(x);
>  }
> 
> and the following will get us required 0 and 1 from this
> 
> mov    0x8(%rcx),%r10d  // load class ptr
> shr     $0x3,%r10
> and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
> xor     $0x1,%r10            // here r10 false if value, true if not value.
> 
> Checking currently generated code I'd split it into traces and count asm instructions, and count asm instructions which could new in suggested codegeneration:
> (doesn't count unconditional branches)
> 
> case x==null & y==null:
>        current code:  3 ALU ops + 2 branches
>   suggested code:  3 ALU ops + 2 branches
> 
> case x==null & y!=null:
>        current code:  3 ALU ops + 2 branches
>   suggested code:  2 ALU ops + 1 branches
> 
> case x==y:
>        current code:  6 ALU ops + 2 branches + 1 load from memory
>   suggested code:  5 ALU ops + 2 branches + 1 load from memory
> 
> case x!=y:
>        current code:  6 ALU ops + 2 branches + 1 load from memory
>   suggested code:  2 ALU ops + 1 branch
> 
> In some cases acmp cost is the same as in before valhalla world  (2 ALU ops + 1 branch)
> 
> The most important fact here is not a quite small decrease of ALU instructions, which honestly won't be visible even on microbenchmarks, but the fact in case of x!=y we eliminate load from memory which even in hit to L1 cache will cost as all other instructions and much more in case of cache miss.
> 
> The final suggested code (simplified):
> 
>     cmp rx,ry
>     je equal
>     xor eax,eax
>     jmp end
> 
> equal:
>     test rx,rx
>     je is_null
>     mov klass ptr from memory to eax
>     shr 3,eax
>     and 1,eax
>     xor  1,eax
>     jmp end
> 
> is_null:
>    mov 1,eax
>    jmp end
> 
> end:
>   // eax contains result

amp code which falls through on "not equal" and takes a jump on "equal":

    cmp rx,ry
    jne 1f  # not equal

    test rx,rx
    jz 1f  # not equal

    testb (rx + byte_in_header), 0x8  # test against bit 3
    jz TAKEN  # if bit is clear, go to TAKEN ("equal")

1f:
   // FALLTHROUGH ("not equal")

amp code which falls through on "equal" and takes a jump on "not equal":

    cmp rx,ry
    jne TAKEN  # not equal

    test rx,rx
    jz TAKEN  # not equal

    testb (rx + byte_in_header), 0x8  # test against bit 3
    jnz TAKEN  # if bit is set, go to TAKEN ("not equal")

   // FALLTHROUGH ("equal")