optimizing acmp in L-World

Thu Jul 26 17:03:21 UTC 2018

Sure, playing with xor is not relevant here. Simple test could be better.
I just had to say that the first idea (old_cmp first then check is_value 
if equals) anyway looks better. In case of false first comparison we 
have the same acmp cost as before, that I think will happen in more than 
50% cases.

On 07/25/2018 08:45 PM, John Rose wrote:
> Good analysis.  I like the point about avoiding the klass load.
>
> If we were to hoist the value-bit into the oop itself, I think the analysis
> would go the other way, because the header load wouldn't happen and
> the null check would not be needed either.  That could possibly be where
> we end up eventually, if we were to *also* hoist klass index info into 64-bit
> oops.  We'd have to convince the GC to let us control the hoisted value-bit
> (and klass index bits), which will take time.  So I think I buy your suggestion.
>
> A more important limitation on the applicability of your analysis:  The JIT
> routinely rearranges the acmp code based on its own static analysis of the
> operands, whether they are, are not, may frequently be, or may occasionally
> be null and/or a value.
>
> Also, there's a corner that should be rounded off:  The acmp instruction
> code is usually targeted to a test, not to the generation of a boolean,
> so it would be natural to consider your specific code idioms as generating
> a branch to one of two locations; it's only rarely a bit in a register.
>
> If you avoid materializing the bit, you can get rid of instructions that steer
> bits down to the LSB, and the xor that flips it.  For an example of taking
> an answer as a branch, see MacroAssembler::check_klass_subtype and
> subroutines.
>
> When you generate a boolean as a jump, there are always two or three
> versions of the code, depending on whether fallthrough is used to signal
> true or false.
>
> Specific examples are below, after your code sequence.
>
> — John
>
> On Jul 25, 2018, at 6:14 PM, Sergey Kuksenko <sergey.kuksenko at oracle.com> wrote:
>> I've measured new Valhalla acmp code on legacy (non-value types code).
>> Performance regression of acmp operation is 10%-20% comparing to before valhalla world. It was measured on microbenchmarks and looks like it shouldn't be significantly visible on real apps except some corner cases.
>> I to suggest to a small improvement which make sense to try.
>>
>> I really that trick:
>>>    bool new_acmp(ptr x, ptr y) {
>>>      if (x == null)  return y == null;
>>>      int bit = (x->class_in_header & Q_TYPE_BIT);
>>>      ptr x1 = x | swar_broadcast_bit(bit != 0);
>>>      return old_acmp(x1, y);
>>>    }
>>>    int swar_broadcast_bit(bool z) {
>>>      return z ? -1 : 0;  // use ALU ops
>>>    }
>> And right now the code checking and mixing if class is value looks like:
>>
>> mov    0x8(%rcx),%r10d  // load class ptr
>> shr     $0x3,%r10
>> and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
>> add    %r10,%rcx            // mix with value bit
>>
>>> Suppose there is a branch free technique for implementing is_value_ptr
>> With the current is_value_ptr scheme I want to suggest branch free technique and return to initial idea of codegeneration
>>> So the logic looks like this, and the new semantics are in
>>> the second "if" statement:
>>>
>>>    bool new_acmp(ptr x, ptr y) {
>>>      if (!old_acmp(x, y))  return false;  // x != y => false
>>>      if (is_value_ptr(x))  return false;  // x.is_value => false
>>>      return true;
>>>    }
>>>    bool old_acmp(ptr x, ptr y) {
>>>      return x == y;  // cmpq etc.
>>>    }
>>>
>> I'd rather rewrite to:
>>
>>   bool new_acmp(ptr x, ptr y) {
>>     if (!old_acmp(x, y))  return false;  // x != y => false
>>     return x!=null && !is_value_ptr(x);
>>   }
>>
>> and the following will get us required 0 and 1 from this
>>
>> mov    0x8(%rcx),%r10d  // load class ptr
>> shr     $0x3,%r10
>> and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
>> xor     $0x1,%r10            // here r10 false if value, true if not value.
>>
>> Checking currently generated code I'd split it into traces and count asm instructions, and count asm instructions which could new in suggested codegeneration:
>> (doesn't count unconditional branches)
>>
>> case x==null & y==null:
>>         current code:  3 ALU ops + 2 branches
>>    suggested code:  3 ALU ops + 2 branches
>>
>> case x==null & y!=null:
>>         current code:  3 ALU ops + 2 branches
>>    suggested code:  2 ALU ops + 1 branches
>>
>> case x==y:
>>         current code:  6 ALU ops + 2 branches + 1 load from memory
>>    suggested code:  5 ALU ops + 2 branches + 1 load from memory
>>
>> case x!=y:
>>         current code:  6 ALU ops + 2 branches + 1 load from memory
>>    suggested code:  2 ALU ops + 1 branch
>>
>> In some cases acmp cost is the same as in before valhalla world  (2 ALU ops + 1 branch)
>>
>> The most important fact here is not a quite small decrease of ALU instructions, which honestly won't be visible even on microbenchmarks, but the fact in case of x!=y we eliminate load from memory which even in hit to L1 cache will cost as all other instructions and much more in case of cache miss.
>>
>> The final suggested code (simplified):
>>
>>      cmp rx,ry
>>      je equal
>>      xor eax,eax
>>      jmp end
>>
>> equal:
>>      test rx,rx
>>      je is_null
>>      mov klass ptr from memory to eax
>>      shr 3,eax
>>      and 1,eax
>>      xor  1,eax
>>      jmp end
>>
>> is_null:
>>     mov 1,eax
>>     jmp end
>>
>> end:
>>    // eax contains result
>
> amp code which falls through on "not equal" and takes a jump on "equal":
>
>      cmp rx,ry
>      jne 1f  # not equal
>
>      test rx,rx
>      jz 1f  # not equal
>
>      testb (rx + byte_in_header), 0x8  # test against bit 3
>      jz TAKEN  # if bit is clear, go to TAKEN ("equal")
>
> 1f:
>     // FALLTHROUGH ("not equal")
>
> amp code which falls through on "equal" and takes a jump on "not equal":
>
>      cmp rx,ry
>      jne TAKEN  # not equal
>
>      test rx,rx
>      jz TAKEN  # not equal
>
>      testb (rx + byte_in_header), 0x8  # test against bit 3
>      jnz TAKEN  # if bit is set, go to TAKEN ("not equal")
>
>     // FALLTHROUGH ("equal")
>