optimizing acmp in L-World
Sergey Kuksenko
sergey.kuksenko at oracle.com
Thu Jul 26 17:03:21 UTC 2018
Sure, playing with xor is not relevant here. Simple test could be better.
I just had to say that the first idea (old_cmp first then check is_value
if equals) anyway looks better. In case of false first comparison we
have the same acmp cost as before, that I think will happen in more than
50% cases.
On 07/25/2018 08:45 PM, John Rose wrote:
> Good analysis. I like the point about avoiding the klass load.
>
> If we were to hoist the value-bit into the oop itself, I think the analysis
> would go the other way, because the header load wouldn't happen and
> the null check would not be needed either. That could possibly be where
> we end up eventually, if we were to *also* hoist klass index info into 64-bit
> oops. We'd have to convince the GC to let us control the hoisted value-bit
> (and klass index bits), which will take time. So I think I buy your suggestion.
>
> A more important limitation on the applicability of your analysis: The JIT
> routinely rearranges the acmp code based on its own static analysis of the
> operands, whether they are, are not, may frequently be, or may occasionally
> be null and/or a value.
>
> Also, there's a corner that should be rounded off: The acmp instruction
> code is usually targeted to a test, not to the generation of a boolean,
> so it would be natural to consider your specific code idioms as generating
> a branch to one of two locations; it's only rarely a bit in a register.
>
> If you avoid materializing the bit, you can get rid of instructions that steer
> bits down to the LSB, and the xor that flips it. For an example of taking
> an answer as a branch, see MacroAssembler::check_klass_subtype and
> subroutines.
>
> When you generate a boolean as a jump, there are always two or three
> versions of the code, depending on whether fallthrough is used to signal
> true or false.
>
> Specific examples are below, after your code sequence.
>
> — John
>
> On Jul 25, 2018, at 6:14 PM, Sergey Kuksenko <sergey.kuksenko at oracle.com> wrote:
>> I've measured new Valhalla acmp code on legacy (non-value types code).
>> Performance regression of acmp operation is 10%-20% comparing to before valhalla world. It was measured on microbenchmarks and looks like it shouldn't be significantly visible on real apps except some corner cases.
>> I to suggest to a small improvement which make sense to try.
>>
>> I really that trick:
>>> bool new_acmp(ptr x, ptr y) {
>>> if (x == null) return y == null;
>>> int bit = (x->class_in_header & Q_TYPE_BIT);
>>> ptr x1 = x | swar_broadcast_bit(bit != 0);
>>> return old_acmp(x1, y);
>>> }
>>> int swar_broadcast_bit(bool z) {
>>> return z ? -1 : 0; // use ALU ops
>>> }
>> And right now the code checking and mixing if class is value looks like:
>>
>> mov 0x8(%rcx),%r10d // load class ptr
>> shr $0x3,%r10
>> and $0x1,%r10 // here r10 is 1 if value, 0 if not value
>> add %r10,%rcx // mix with value bit
>>
>>> Suppose there is a branch free technique for implementing is_value_ptr
>> With the current is_value_ptr scheme I want to suggest branch free technique and return to initial idea of codegeneration
>>> So the logic looks like this, and the new semantics are in
>>> the second "if" statement:
>>>
>>> bool new_acmp(ptr x, ptr y) {
>>> if (!old_acmp(x, y)) return false; // x != y => false
>>> if (is_value_ptr(x)) return false; // x.is_value => false
>>> return true;
>>> }
>>> bool old_acmp(ptr x, ptr y) {
>>> return x == y; // cmpq etc.
>>> }
>>>
>> I'd rather rewrite to:
>>
>> bool new_acmp(ptr x, ptr y) {
>> if (!old_acmp(x, y)) return false; // x != y => false
>> return x!=null && !is_value_ptr(x);
>> }
>>
>> and the following will get us required 0 and 1 from this
>>
>> mov 0x8(%rcx),%r10d // load class ptr
>> shr $0x3,%r10
>> and $0x1,%r10 // here r10 is 1 if value, 0 if not value
>> xor $0x1,%r10 // here r10 false if value, true if not value.
>>
>> Checking currently generated code I'd split it into traces and count asm instructions, and count asm instructions which could new in suggested codegeneration:
>> (doesn't count unconditional branches)
>>
>> case x==null & y==null:
>> current code: 3 ALU ops + 2 branches
>> suggested code: 3 ALU ops + 2 branches
>>
>> case x==null & y!=null:
>> current code: 3 ALU ops + 2 branches
>> suggested code: 2 ALU ops + 1 branches
>>
>> case x==y:
>> current code: 6 ALU ops + 2 branches + 1 load from memory
>> suggested code: 5 ALU ops + 2 branches + 1 load from memory
>>
>> case x!=y:
>> current code: 6 ALU ops + 2 branches + 1 load from memory
>> suggested code: 2 ALU ops + 1 branch
>>
>> In some cases acmp cost is the same as in before valhalla world (2 ALU ops + 1 branch)
>>
>> The most important fact here is not a quite small decrease of ALU instructions, which honestly won't be visible even on microbenchmarks, but the fact in case of x!=y we eliminate load from memory which even in hit to L1 cache will cost as all other instructions and much more in case of cache miss.
>>
>> The final suggested code (simplified):
>>
>> cmp rx,ry
>> je equal
>> xor eax,eax
>> jmp end
>>
>> equal:
>> test rx,rx
>> je is_null
>> mov klass ptr from memory to eax
>> shr 3,eax
>> and 1,eax
>> xor 1,eax
>> jmp end
>>
>> is_null:
>> mov 1,eax
>> jmp end
>>
>> end:
>> // eax contains result
>
> amp code which falls through on "not equal" and takes a jump on "equal":
>
> cmp rx,ry
> jne 1f # not equal
>
> test rx,rx
> jz 1f # not equal
>
> testb (rx + byte_in_header), 0x8 # test against bit 3
> jz TAKEN # if bit is clear, go to TAKEN ("equal")
>
> 1f:
> // FALLTHROUGH ("not equal")
>
> amp code which falls through on "equal" and takes a jump on "not equal":
>
> cmp rx,ry
> jne TAKEN # not equal
>
> test rx,rx
> jz TAKEN # not equal
>
> testb (rx + byte_in_header), 0x8 # test against bit 3
> jnz TAKEN # if bit is set, go to TAKEN ("not equal")
>
> // FALLTHROUGH ("equal")
>
More information about the valhalla-dev
mailing list