optimizing acmp in L-World
Sergey Kuksenko
sergey.kuksenko at oracle.com
Thu Jul 26 01:14:51 UTC 2018
I've measured new Valhalla acmp code on legacy (non-value types code).
Performance regression of acmp operation is 10%-20% comparing to before
valhalla world. It was measured on microbenchmarks and looks like it
shouldn't be significantly visible on real apps except some corner cases.
I to suggest to a small improvement which make sense to try.
I really that trick:
> bool new_acmp(ptr x, ptr y) {
> if (x == null) return y == null;
> int bit = (x->class_in_header & Q_TYPE_BIT);
> ptr x1 = x | swar_broadcast_bit(bit != 0);
> return old_acmp(x1, y);
> }
> int swar_broadcast_bit(bool z) {
> return z ? -1 : 0; // use ALU ops
> }
And right now the code checking and mixing if class is value looks like:
mov 0x8(%rcx),%r10d // load class ptr
shr $0x3,%r10
and $0x1,%r10 // here r10 is 1 if value, 0 if not value
add %r10,%rcx // mix with value bit
> Suppose there is a branch free technique for implementing is_value_ptr
With the current is_value_ptr scheme I want to suggest branch free
technique and return to initial idea of codegeneration
> So the logic looks like this, and the new semantics are in
> the second "if" statement:
>
> bool new_acmp(ptr x, ptr y) {
> if (!old_acmp(x, y)) return false; // x != y => false
> if (is_value_ptr(x)) return false; // x.is_value => false
> return true;
> }
> bool old_acmp(ptr x, ptr y) {
> return x == y; // cmpq etc.
> }
>
I'd rather rewrite to:
bool new_acmp(ptr x, ptr y) {
if (!old_acmp(x, y)) return false; // x != y => false
return x!=null && !is_value_ptr(x);
}
and the following will get us required 0 and 1 from this
mov 0x8(%rcx),%r10d // load class ptr
shr $0x3,%r10
and $0x1,%r10 // here r10 is 1 if value, 0 if not value
xor $0x1,%r10 // here r10 false if value, true if not value.
Checking currently generated code I'd split it into traces and count asm
instructions, and count asm instructions which could new in suggested
codegeneration:
(doesn't count unconditional branches)
case x==null & y==null:
current code: 3 ALU ops + 2 branches
suggested code: 3 ALU ops + 2 branches
case x==null & y!=null:
current code: 3 ALU ops + 2 branches
suggested code: 2 ALU ops + 1 branches
case x==y:
current code: 6 ALU ops + 2 branches + 1 load from memory
suggested code: 5 ALU ops + 2 branches + 1 load from memory
case x!=y:
current code: 6 ALU ops + 2 branches + 1 load from memory
suggested code: 2 ALU ops + 1 branch
In some cases acmp cost is the same as in before valhalla world (2 ALU
ops + 1 branch)
The most important fact here is not a quite small decrease of ALU
instructions, which honestly won't be visible even on microbenchmarks,
but the fact in case of x!=y we eliminate load from memory which even in
hit to L1 cache will cost as all other instructions and much more in
case of cache miss.
The final suggested code (simplified):
cmp rx,ry
je equal
xor eax,eax
jmp end
equal:
test rx,rx
je is_null
mov klass ptr from memory to eax
shr 3,eax
and 1,eax
xor 1,eax
jmp end
is_null:
mov 1,eax
jmp end
end:
// eax contains result
More information about the valhalla-dev
mailing list