optimizing acmp in L-World

Sergey Kuksenko sergey.kuksenko at oracle.com
Tue Aug 21 23:21:45 UTC 2018


Hi Tobias,


On 08/17/2018 06:19 AM, Tobias Hartmann wrote:
>
> As John already pointed out, this only helps if we want the result in a register. Due to aggressive
> inlining in C2, this is rarely the case but usually we need an explicit cmp/jmp to branch to the
> corresponding target blocks.
>
> It's also not possible in the interpreter where we always explicitly branch.
I want say that in case of branching code can be even simpler (below).
> I've implemented this (-XX:+UseOldAcmp):
> http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/
>
> The generated code then looks like this:
>
>   cmp    %rdx,%rcx
>   jne    is_ne
>   test   %rcx,%rcx
>   je     is_null
>   mov    0x8(%rcx),%r11d
>   mov    %r11,%r10
>   shr    $0x3,%r10
>   test   $0x1,%r10
>   jne    is_ne
> is_null:
>   mov    $0x1,%eax
>   jmp    end
> is_ne:
>   xor    %eax,%eax
> end:

I'd rather suggest in case of branching to replace

  mov    0x8(%rcx),%r11d
  mov    %r11,%r10
  shr    $0x3,%r10
  test   $0x1,%r10
  jne...

with

  test    0x8(%rcx),$0x8
  jne...

It will save registers which may be visible on highly inlined code.

>
> I'm using this benchmark for evaluation:
> http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/NewAcmpBenchmark.java
>
> Unfortunately, the results are highly dependent on profiling information and the resulting layout of
> code. I've therefore executed the benchmark with -XX:-ProfileInterpreter. Here are the results:
>
> With -XX:+UseOldAcmp
> Benchmark                             Mode  Cnt    Score   Error   Units
> NewAcmpBenchmark.newCmpEqAll         thrpt    5   44.109 ± 0.045  ops/us
> NewAcmpBenchmark.newCmpEq_null_null  thrpt    5  112.509 ± 0.073  ops/us
> NewAcmpBenchmark.newCmpEq_null_o1    thrpt    5  149.950 ± 0.124  ops/us
> NewAcmpBenchmark.newCmpEq_o1_null    thrpt    5  149.960 ± 0.130  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o1      thrpt    5  132.312 ± 0.171  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o2      thrpt    5  140.188 ± 0.056  ops/us
> NewAcmpBenchmark.newCmpEq_o2_o1      thrpt    5  140.119 ± 0.141  ops/us
>
> With -XX:-UseOldAcmp
> Benchmark                             Mode  Cnt    Score   Error   Units
> NewAcmpBenchmark.newCmpEqAll         thrpt    5   41.727 ± 0.022  ops/us
> NewAcmpBenchmark.newCmpEq_null_null  thrpt    5  124.938 ± 0.088  ops/us
> NewAcmpBenchmark.newCmpEq_null_o1    thrpt    5  139.842 ± 0.108  ops/us
> NewAcmpBenchmark.newCmpEq_o1_null    thrpt    5  149.930 ± 0.105  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o1      thrpt    5  138.952 ± 0.065  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o2      thrpt    5  131.435 ± 0.119  ops/us
> NewAcmpBenchmark.newCmpEq_o2_o1      thrpt    5  131.307 ± 0.140  ops/us
>
> As expected, your version is better if a != b. In the a == b case, the current implementation is
> better. If we are assuming that != is more common than == (like in the newCmpEqAll case), your
> version is better.
I need more time to check this.
I just want to give you couple advices about microbenchmarks.
- double DON_INLINE is not requred, having only on "cmpEq" is enough and 
you'll get less noisy invocations overhead.
- dealing with small operations it's better to switch to benchmark mode 
to average time in nanoseconds - that allows to quickly notice if 
something wrong.
>
> Which acmp microbenchmarks are you using? I couldn't find any in the set of benchmarks that you've
> pushed.
I am still thinking how to make representative set of acmp micros, to 
properly cover inlined case as the most important.
I'll push them when they will be ready.





More information about the valhalla-dev mailing list