[Exp] First prototype of new acmp bytecode

Fri Mar 9 10:02:27 UTC 2018

Hi,

On 08.03.2018 17:39, Tobias Hartmann wrote:
> -XX:-TieredCompilation -XX:-UseNewAcmp
> Benchmark                                Mode  Cnt    Score   Error   Units
> NewAcmpBenchmark.newCmp                 thrpt  200  108.911 ± 0.086  ops/us
> NewAcmpBenchmark.newCmpDoubleNull       thrpt  200   88.206 ± 4.792  ops/us
> NewAcmpBenchmark.newCmpDoubleNullFalse  thrpt  200   72.742 ± 7.563  ops/us
> NewAcmpBenchmark.newCmpField            thrpt  200  107.090 ± 0.083  ops/us
> NewAcmpBenchmark.oldCmp                 thrpt  200  114.466 ± 0.077  ops/us
> 
> -XX:-TieredCompilation -XX:+UseNewAcmp -XX:ValueBasedClasses=compiler/valhalla/valuetypes/MyValue
> Benchmark                                Mode  Cnt    Score   Error   Units
> NewAcmpBenchmark.newCmp                 thrpt  200  101.480 ± 0.260  ops/us
> NewAcmpBenchmark.newCmpDoubleNull       thrpt  200   90.429 ± 4.741  ops/us
> NewAcmpBenchmark.newCmpDoubleNullFalse  thrpt  200   81.230 ± 4.115  ops/us
> NewAcmpBenchmark.newCmpField            thrpt  200  102.224 ± 0.019  ops/us
> NewAcmpBenchmark.oldCmp                 thrpt  200  114.336 ± 0.239  ops/us
> 
> In the worst case, if we need to emit the new acmp and the first operand is not null, there is a
> performance impact of 6.80% (see newCmp).
> 
> However, in many cases we can use static type information to optimize. For example, if we know that
> one operand is a value type, we can emit a "double null check". This causes the performance impact
> to disappear into the noise (see newCmpDoubleNull). If we know in addition that one operand is
> always non-null, we can emit a static false. This improves performance by ~11% (high error) compared
> to old acmp.
> 
> There is one pitfall. If we compare two object fields, C2 optimizes old acmp to directly compare the
> narrow oops (no need to decode). With the new acmp, we need to decode the oop because we use derived
> oops for perturbation. Surprisingly, the newCmpField benchmark shows that the regression is even
> lower than in the newCmp case (4.5%). That's probably because the comparison is always false and
> therefore the CPUs branch prediction works better, mitigating the performance impact of t> additional instructions.

I've executed an additional run with -XX:-UseCompressedOops and the performance results are still
the same. That means the overhead of decoding the oop is not measurable (in this microbenchmark).

> The last benchmark (oldCmp) verifies that if C2 is able to determine that one operand is not a value
> type, we can use the old acmp and performance is equal to the baseline.
> 
> I will re-run the tests with type speculation enabled to see how much of a difference that makes.

I did and the results are rather surprising (as so often with these microbenchmarks). With the
baseline run having the exact same results as without type speculation (see above):

-XX:-TieredCompilation -XX:-UseNewAcmp -XX:TypeProfileLevel=222
Benchmark                                Mode  Cnt    Score   Error   Units
NewAcmpBenchmark.newCmp                 thrpt  200  108.911 ± 0.086  ops/us
NewAcmpBenchmark.newCmpDoubleNull       thrpt  200   88.206 ± 4.792  ops/us
NewAcmpBenchmark.newCmpDoubleNullFalse  thrpt  200   72.742 ± 7.563  ops/us
NewAcmpBenchmark.newCmpField            thrpt  200  107.041 ± 0.142  ops/us
NewAcmpBenchmark.oldCmp                 thrpt  200  114.466 ± 0.077  ops/us

.. the performance of the patched run is much worse:

-XX:-TieredCompilation -XX:+UseNewAcmp -XX:TypeProfileLevel=222
-XX:ValueBasedClasses=compiler/valhalla/valuetypes/MyValue
Benchmark                                Mode  Cnt    Score   Error   Units
NewAcmpBenchmark.newCmp                 thrpt  200   95.040 ± 0.060  ops/us
NewAcmpBenchmark.newCmpDoubleNull       thrpt  200   84.122 ± 4.117  ops/us
NewAcmpBenchmark.newCmpDoubleNullFalse  thrpt  200   89.854 ± 5.014  ops/us
NewAcmpBenchmark.newCmpField            thrpt  200  102.218 ± 0.020  ops/us
NewAcmpBenchmark.oldCmp                 thrpt  200  114.456 ± 0.077  ops/us

The problem is that with type speculation, C2 adds an uncommon trap (Java call) and although that
trap is never taken, we now need stack banging at method entry (see Compile::need_stack_bang()):

0x00007f62f8ea5f40: mov    %eax,-0x14000(%rsp)

I've verified that disabling stack banging brings back performance (almost - it seems like method
size has also an impact) but the implicit null check does not improve performance in this case.

Thanks,
Tobias