optimizing acmp in L-World
Tobias Hartmann
tobias.hartmann at oracle.com
Fri Aug 17 13:19:46 UTC 2018
Hi Sergey,
thanks for this analysis and sorry for the delay. Please see comments below.
On 26.07.2018 03:14, Sergey Kuksenko wrote:
> I've measured new Valhalla acmp code on legacy (non-value types code).
> Performance regression of acmp operation is 10%-20% comparing to before valhalla world. It was
> measured on microbenchmarks and looks like it shouldn't be significantly visible on real apps except
> some corner cases.
Yes, I don't think that this is measurable for real applications. Especially because real
applications usually not just check for equality but also do something with these objects. In most
cases, C2 will either be able to figure out at compile time that one of the operands is *not* a
value type (and therefore emit old acmp) or that one operand *is* a value type (and therefore emit a
double null check because a value type can only ever be equal to null).
In addition, if C2 emits a null check or klass load, it will re-use the knowledge/result throughout
the scope of the compilation. For example, although the additional null check will slow things down,
C2 can omit another null check that would usually follow up in one of the branches.
> And right now the code checking and mixing if class is value looks like:
>
> mov 0x8(%rcx),%r10d // load class ptr
> shr $0x3,%r10
> and $0x1,%r10 // here r10 is 1 if value, 0 if not value
> add %r10,%rcx // mix with value bit
Yes, C2 generated code for old acmp 'return a == b;' looks like this:
xor %r10d,%r10d
mov $0x1,%eax
cmp %rdx,%rcx
cmovne %r10d,%eax
Whereas new acmp looks like this:
test %rcx,%rcx
je is_null
mov 0x8(%rcx),%r10d
shr $0x3,%r10
and $0x1,%r10
add %r10,%rcx
is_null:
xor %r11d,%r11d
mov $0x1,%eax
cmp %rdx,%rcx
cmovne %r11d,%eax
> I'd rather rewrite to:
>
> bool new_acmp(ptr x, ptr y) {
> if (!old_acmp(x, y)) return false; // x != y => false
> return x!=null && !is_value_ptr(x);
> }
I gave that a try (see below).
> and the following will get us required 0 and 1 from this
>
> mov 0x8(%rcx),%r10d // load class ptr
> shr $0x3,%r10
> and $0x1,%r10 // here r10 is 1 if value, 0 if not value
> xor $0x1,%r10 // here r10 false if value, true if not value.
As John already pointed out, this only helps if we want the result in a register. Due to aggressive
inlining in C2, this is rarely the case but usually we need an explicit cmp/jmp to branch to the
corresponding target blocks.
It's also not possible in the interpreter where we always explicitly branch.
> Checking currently generated code I'd split it into traces and count asm instructions, and count asm
> instructions which could new in suggested codegeneration:
> (doesn't count unconditional branches)
>
> case x==null & y==null:
> current code: 3 ALU ops + 2 branches
> suggested code: 3 ALU ops + 2 branches
>
> case x==null & y!=null:
> current code: 3 ALU ops + 2 branches
> suggested code: 2 ALU ops + 1 branches
>
> case x==y:
> current code: 6 ALU ops + 2 branches + 1 load from memory
> suggested code: 5 ALU ops + 2 branches + 1 load from memory
>
> case x!=y:
> current code: 6 ALU ops + 2 branches + 1 load from memory
> suggested code: 2 ALU ops + 1 branch
>
> In some cases acmp cost is the same as in before valhalla world (2 ALU ops + 1 branch)
>
> The most important fact here is not a quite small decrease of ALU instructions, which honestly won't
> be visible even on microbenchmarks, but the fact in case of x!=y we eliminate load from memory which
> even in hit to L1 cache will cost as all other instructions and much more in case of cache miss.
Yes, I think avoiding the load from memory is the main benefit here.
> The final suggested code (simplified):
>
> cmp rx,ry
> je equal
> xor eax,eax
> jmp end
>
> equal:
> test rx,rx
> je is_null
> mov klass ptr from memory to eax
> shr 3,eax
> and 1,eax
> xor 1,eax
> jmp end
>
> is_null:
> mov 1,eax
> jmp end
>
> end:
> // eax contains result
I've implemented this (-XX:+UseOldAcmp):
http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/
The generated code then looks like this:
cmp %rdx,%rcx
jne is_ne
test %rcx,%rcx
je is_null
mov 0x8(%rcx),%r11d
mov %r11,%r10
shr $0x3,%r10
test $0x1,%r10
jne is_ne
is_null:
mov $0x1,%eax
jmp end
is_ne:
xor %eax,%eax
end:
I'm using this benchmark for evaluation:
http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/NewAcmpBenchmark.java
Unfortunately, the results are highly dependent on profiling information and the resulting layout of
code. I've therefore executed the benchmark with -XX:-ProfileInterpreter. Here are the results:
With -XX:+UseOldAcmp
Benchmark Mode Cnt Score Error Units
NewAcmpBenchmark.newCmpEqAll thrpt 5 44.109 ± 0.045 ops/us
NewAcmpBenchmark.newCmpEq_null_null thrpt 5 112.509 ± 0.073 ops/us
NewAcmpBenchmark.newCmpEq_null_o1 thrpt 5 149.950 ± 0.124 ops/us
NewAcmpBenchmark.newCmpEq_o1_null thrpt 5 149.960 ± 0.130 ops/us
NewAcmpBenchmark.newCmpEq_o1_o1 thrpt 5 132.312 ± 0.171 ops/us
NewAcmpBenchmark.newCmpEq_o1_o2 thrpt 5 140.188 ± 0.056 ops/us
NewAcmpBenchmark.newCmpEq_o2_o1 thrpt 5 140.119 ± 0.141 ops/us
With -XX:-UseOldAcmp
Benchmark Mode Cnt Score Error Units
NewAcmpBenchmark.newCmpEqAll thrpt 5 41.727 ± 0.022 ops/us
NewAcmpBenchmark.newCmpEq_null_null thrpt 5 124.938 ± 0.088 ops/us
NewAcmpBenchmark.newCmpEq_null_o1 thrpt 5 139.842 ± 0.108 ops/us
NewAcmpBenchmark.newCmpEq_o1_null thrpt 5 149.930 ± 0.105 ops/us
NewAcmpBenchmark.newCmpEq_o1_o1 thrpt 5 138.952 ± 0.065 ops/us
NewAcmpBenchmark.newCmpEq_o1_o2 thrpt 5 131.435 ± 0.119 ops/us
NewAcmpBenchmark.newCmpEq_o2_o1 thrpt 5 131.307 ± 0.140 ops/us
As expected, your version is better if a != b. In the a == b case, the current implementation is
better. If we are assuming that != is more common than == (like in the newCmpEqAll case), your
version is better.
Which acmp microbenchmarks are you using? I couldn't find any in the set of benchmarks that you've
pushed.
Thanks,
Tobias
More information about the valhalla-dev
mailing list