optimizing acmp in L-World
Sergey Kuksenko
sergey.kuksenko at oracle.com
Tue Aug 21 23:21:45 UTC 2018
Hi Tobias,
On 08/17/2018 06:19 AM, Tobias Hartmann wrote:
>
> As John already pointed out, this only helps if we want the result in a register. Due to aggressive
> inlining in C2, this is rarely the case but usually we need an explicit cmp/jmp to branch to the
> corresponding target blocks.
>
> It's also not possible in the interpreter where we always explicitly branch.
I want say that in case of branching code can be even simpler (below).
> I've implemented this (-XX:+UseOldAcmp):
> http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/
>
> The generated code then looks like this:
>
> cmp %rdx,%rcx
> jne is_ne
> test %rcx,%rcx
> je is_null
> mov 0x8(%rcx),%r11d
> mov %r11,%r10
> shr $0x3,%r10
> test $0x1,%r10
> jne is_ne
> is_null:
> mov $0x1,%eax
> jmp end
> is_ne:
> xor %eax,%eax
> end:
I'd rather suggest in case of branching to replace
mov 0x8(%rcx),%r11d
mov %r11,%r10
shr $0x3,%r10
test $0x1,%r10
jne...
with
test 0x8(%rcx),$0x8
jne...
It will save registers which may be visible on highly inlined code.
>
> I'm using this benchmark for evaluation:
> http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/NewAcmpBenchmark.java
>
> Unfortunately, the results are highly dependent on profiling information and the resulting layout of
> code. I've therefore executed the benchmark with -XX:-ProfileInterpreter. Here are the results:
>
> With -XX:+UseOldAcmp
> Benchmark Mode Cnt Score Error Units
> NewAcmpBenchmark.newCmpEqAll thrpt 5 44.109 ± 0.045 ops/us
> NewAcmpBenchmark.newCmpEq_null_null thrpt 5 112.509 ± 0.073 ops/us
> NewAcmpBenchmark.newCmpEq_null_o1 thrpt 5 149.950 ± 0.124 ops/us
> NewAcmpBenchmark.newCmpEq_o1_null thrpt 5 149.960 ± 0.130 ops/us
> NewAcmpBenchmark.newCmpEq_o1_o1 thrpt 5 132.312 ± 0.171 ops/us
> NewAcmpBenchmark.newCmpEq_o1_o2 thrpt 5 140.188 ± 0.056 ops/us
> NewAcmpBenchmark.newCmpEq_o2_o1 thrpt 5 140.119 ± 0.141 ops/us
>
> With -XX:-UseOldAcmp
> Benchmark Mode Cnt Score Error Units
> NewAcmpBenchmark.newCmpEqAll thrpt 5 41.727 ± 0.022 ops/us
> NewAcmpBenchmark.newCmpEq_null_null thrpt 5 124.938 ± 0.088 ops/us
> NewAcmpBenchmark.newCmpEq_null_o1 thrpt 5 139.842 ± 0.108 ops/us
> NewAcmpBenchmark.newCmpEq_o1_null thrpt 5 149.930 ± 0.105 ops/us
> NewAcmpBenchmark.newCmpEq_o1_o1 thrpt 5 138.952 ± 0.065 ops/us
> NewAcmpBenchmark.newCmpEq_o1_o2 thrpt 5 131.435 ± 0.119 ops/us
> NewAcmpBenchmark.newCmpEq_o2_o1 thrpt 5 131.307 ± 0.140 ops/us
>
> As expected, your version is better if a != b. In the a == b case, the current implementation is
> better. If we are assuming that != is more common than == (like in the newCmpEqAll case), your
> version is better.
I need more time to check this.
I just want to give you couple advices about microbenchmarks.
- double DON_INLINE is not requred, having only on "cmpEq" is enough and
you'll get less noisy invocations overhead.
- dealing with small operations it's better to switch to benchmark mode
to average time in nanoseconds - that allows to quickly notice if
something wrong.
>
> Which acmp microbenchmarks are you using? I couldn't find any in the set of benchmarks that you've
> pushed.
I am still thinking how to make representative set of acmp micros, to
properly cover inlined case as the most important.
I'll push them when they will be ready.
More information about the valhalla-dev
mailing list