optimizing acmp in L-World

Fri Aug 17 13:19:46 UTC 2018

Hi Sergey,

thanks for this analysis and sorry for the delay. Please see comments below.

On 26.07.2018 03:14, Sergey Kuksenko wrote:
> I've measured new Valhalla acmp code on legacy (non-value types code).
> Performance regression of acmp operation is 10%-20% comparing to before valhalla world. It was
> measured on microbenchmarks and looks like it shouldn't be significantly visible on real apps except
> some corner cases.

Yes, I don't think that this is measurable for real applications. Especially because real
applications usually not just check for equality but also do something with these objects. In most
cases, C2 will either be able to figure out at compile time that one of the operands is *not* a
value type (and therefore emit old acmp) or that one operand *is* a value type (and therefore emit a
double null check because a value type can only ever be equal to null).

In addition, if C2 emits a null check or klass load, it will re-use the knowledge/result throughout
the scope of the compilation. For example, although the additional null check will slow things down,
C2 can omit another null check that would usually follow up in one of the branches.

> And right now the code checking and mixing if class is value looks like:
> 
> mov    0x8(%rcx),%r10d  // load class ptr
> shr     $0x3,%r10
> and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
> add    %r10,%rcx            // mix with value bit

Yes, C2 generated code for old acmp 'return a == b;' looks like this:

 xor    %r10d,%r10d
 mov    $0x1,%eax
 cmp    %rdx,%rcx
 cmovne %r10d,%eax

Whereas new acmp looks like this:

 test   %rcx,%rcx
 je     is_null
 mov    0x8(%rcx),%r10d
 shr    $0x3,%r10
 and    $0x1,%r10
 add    %r10,%rcx
is_null:
 xor    %r11d,%r11d
 mov    $0x1,%eax
 cmp    %rdx,%rcx
 cmovne %r11d,%eax

> I'd rather rewrite to:
> 
>   bool new_acmp(ptr x, ptr y) {
>     if (!old_acmp(x, y))  return false;  // x != y => false
>     return x!=null && !is_value_ptr(x);
>   }

I gave that a try (see below).

> and the following will get us required 0 and 1 from this
> 
> mov    0x8(%rcx),%r10d  // load class ptr
> shr     $0x3,%r10
> and    $0x1,%r10            // here r10 is 1 if value, 0 if not value
> xor     $0x1,%r10            // here r10 false if value, true if not value.

As John already pointed out, this only helps if we want the result in a register. Due to aggressive
inlining in C2, this is rarely the case but usually we need an explicit cmp/jmp to branch to the
corresponding target blocks.

It's also not possible in the interpreter where we always explicitly branch.

> Checking currently generated code I'd split it into traces and count asm instructions, and count asm
> instructions which could new in suggested codegeneration:
> (doesn't count unconditional branches)
> 
> case x==null & y==null:
>        current code:  3 ALU ops + 2 branches
>   suggested code:  3 ALU ops + 2 branches
> 
> case x==null & y!=null:
>        current code:  3 ALU ops + 2 branches
>   suggested code:  2 ALU ops + 1 branches
> 
> case x==y:
>        current code:  6 ALU ops + 2 branches + 1 load from memory
>   suggested code:  5 ALU ops + 2 branches + 1 load from memory
> 
> case x!=y:
>        current code:  6 ALU ops + 2 branches + 1 load from memory
>   suggested code:  2 ALU ops + 1 branch
> 
> In some cases acmp cost is the same as in before valhalla world  (2 ALU ops + 1 branch)
> 
> The most important fact here is not a quite small decrease of ALU instructions, which honestly won't
> be visible even on microbenchmarks, but the fact in case of x!=y we eliminate load from memory which
> even in hit to L1 cache will cost as all other instructions and much more in case of cache miss.

Yes, I think avoiding the load from memory is the main benefit here.

> The final suggested code (simplified):
> 
>     cmp rx,ry
>     je equal
>     xor eax,eax
>     jmp end
> 
> equal:
>     test rx,rx
>     je is_null
>     mov klass ptr from memory to eax
>     shr 3,eax
>     and 1,eax
>     xor  1,eax
>     jmp end
> 
> is_null:
>    mov 1,eax
>    jmp end
> 
> end:
>   // eax contains result

I've implemented this (-XX:+UseOldAcmp):
http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/

The generated code then looks like this:

 cmp    %rdx,%rcx
 jne    is_ne
 test   %rcx,%rcx
 je     is_null
 mov    0x8(%rcx),%r11d
 mov    %r11,%r10
 shr    $0x3,%r10
 test   $0x1,%r10
 jne    is_ne
is_null:
 mov    $0x1,%eax
 jmp    end
is_ne:
 xor    %eax,%eax
end:

I'm using this benchmark for evaluation:
http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.00/NewAcmpBenchmark.java

Unfortunately, the results are highly dependent on profiling information and the resulting layout of
code. I've therefore executed the benchmark with -XX:-ProfileInterpreter. Here are the results:

With -XX:+UseOldAcmp
Benchmark                             Mode  Cnt    Score   Error   Units
NewAcmpBenchmark.newCmpEqAll         thrpt    5   44.109 ± 0.045  ops/us
NewAcmpBenchmark.newCmpEq_null_null  thrpt    5  112.509 ± 0.073  ops/us
NewAcmpBenchmark.newCmpEq_null_o1    thrpt    5  149.950 ± 0.124  ops/us
NewAcmpBenchmark.newCmpEq_o1_null    thrpt    5  149.960 ± 0.130  ops/us
NewAcmpBenchmark.newCmpEq_o1_o1      thrpt    5  132.312 ± 0.171  ops/us
NewAcmpBenchmark.newCmpEq_o1_o2      thrpt    5  140.188 ± 0.056  ops/us
NewAcmpBenchmark.newCmpEq_o2_o1      thrpt    5  140.119 ± 0.141  ops/us

With -XX:-UseOldAcmp
Benchmark                             Mode  Cnt    Score   Error   Units
NewAcmpBenchmark.newCmpEqAll         thrpt    5   41.727 ± 0.022  ops/us
NewAcmpBenchmark.newCmpEq_null_null  thrpt    5  124.938 ± 0.088  ops/us
NewAcmpBenchmark.newCmpEq_null_o1    thrpt    5  139.842 ± 0.108  ops/us
NewAcmpBenchmark.newCmpEq_o1_null    thrpt    5  149.930 ± 0.105  ops/us
NewAcmpBenchmark.newCmpEq_o1_o1      thrpt    5  138.952 ± 0.065  ops/us
NewAcmpBenchmark.newCmpEq_o1_o2      thrpt    5  131.435 ± 0.119  ops/us
NewAcmpBenchmark.newCmpEq_o2_o1      thrpt    5  131.307 ± 0.140  ops/us

As expected, your version is better if a != b. In the a == b case, the current implementation is
better. If we are assuming that != is more common than == (like in the newCmpEqAll case), your
version is better.

Which acmp microbenchmarks are you using? I couldn't find any in the set of benchmarks that you've
pushed.

Thanks,
Tobias