optimizing acmp in L-World

Tue Aug 21 23:21:45 UTC 2018

Hi Tobias,

On 08/17/2018 06:19 AM, Tobias Hartmann wrote:
>
> As John already pointed out, this only helps if we want the result in a register. Due to aggressive
> inlining in C2, this is rarely the case but usually we need an explicit cmp/jmp to branch to the
> corresponding target blocks.
>
> It's also not possible in the interpreter where we always explicitly branch.
I want say that in case of branching code can be even simpler (below).
> I've implemented this (-XX:+UseOldAcmp):
> http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/
>
> The generated code then looks like this:
>
>   cmp    %rdx,%rcx
>   jne    is_ne
>   test   %rcx,%rcx
>   je     is_null
>   mov    0x8(%rcx),%r11d
>   mov    %r11,%r10
>   shr    $0x3,%r10
>   test   $0x1,%r10
>   jne    is_ne
> is_null:
>   mov    $0x1,%eax
>   jmp    end
> is_ne:
>   xor    %eax,%eax
> end:

I'd rather suggest in case of branching to replace

  mov    0x8(%rcx),%r11d
  mov    %r11,%r10
  shr    $0x3,%r10
  test   $0x1,%r10
  jne...

with

  test    0x8(%rcx),$0x8
  jne...

It will save registers which may be visible on highly inlined code.

>
> I'm using this benchmark for evaluation:
> http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/NewAcmpBenchmark.java
>
> Unfortunately, the results are highly dependent on profiling information and the resulting layout of
> code. I've therefore executed the benchmark with -XX:-ProfileInterpreter. Here are the results:
>
> With -XX:+UseOldAcmp
> Benchmark                             Mode  Cnt    Score   Error   Units
> NewAcmpBenchmark.newCmpEqAll         thrpt    5   44.109 ± 0.045  ops/us
> NewAcmpBenchmark.newCmpEq_null_null  thrpt    5  112.509 ± 0.073  ops/us
> NewAcmpBenchmark.newCmpEq_null_o1    thrpt    5  149.950 ± 0.124  ops/us
> NewAcmpBenchmark.newCmpEq_o1_null    thrpt    5  149.960 ± 0.130  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o1      thrpt    5  132.312 ± 0.171  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o2      thrpt    5  140.188 ± 0.056  ops/us
> NewAcmpBenchmark.newCmpEq_o2_o1      thrpt    5  140.119 ± 0.141  ops/us
>
> With -XX:-UseOldAcmp
> Benchmark                             Mode  Cnt    Score   Error   Units
> NewAcmpBenchmark.newCmpEqAll         thrpt    5   41.727 ± 0.022  ops/us
> NewAcmpBenchmark.newCmpEq_null_null  thrpt    5  124.938 ± 0.088  ops/us
> NewAcmpBenchmark.newCmpEq_null_o1    thrpt    5  139.842 ± 0.108  ops/us
> NewAcmpBenchmark.newCmpEq_o1_null    thrpt    5  149.930 ± 0.105  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o1      thrpt    5  138.952 ± 0.065  ops/us
> NewAcmpBenchmark.newCmpEq_o1_o2      thrpt    5  131.435 ± 0.119  ops/us
> NewAcmpBenchmark.newCmpEq_o2_o1      thrpt    5  131.307 ± 0.140  ops/us
>
> As expected, your version is better if a != b. In the a == b case, the current implementation is
> better. If we are assuming that != is more common than == (like in the newCmpEqAll case), your
> version is better.
I need more time to check this.
I just want to give you couple advices about microbenchmarks.
- double DON_INLINE is not requred, having only on "cmpEq" is enough and 
you'll get less noisy invocations overhead.
- dealing with small operations it's better to switch to benchmark mode 
to average time in nanoseconds - that allows to quickly notice if 
something wrong.
>
> Which acmp microbenchmarks are you using? I couldn't find any in the set of benchmarks that you've
> pushed.
I am still thinking how to make representative set of acmp micros, to 
properly cover inlined case as the most important.
I'll push them when they will be ready.