[Exp] First prototype of new acmp bytecode
Tobias Hartmann
tobias.hartmann at oracle.com
Fri Mar 9 10:02:27 UTC 2018
Hi,
On 08.03.2018 17:39, Tobias Hartmann wrote:
> -XX:-TieredCompilation -XX:-UseNewAcmp
> Benchmark Mode Cnt Score Error Units
> NewAcmpBenchmark.newCmp thrpt 200 108.911 ± 0.086 ops/us
> NewAcmpBenchmark.newCmpDoubleNull thrpt 200 88.206 ± 4.792 ops/us
> NewAcmpBenchmark.newCmpDoubleNullFalse thrpt 200 72.742 ± 7.563 ops/us
> NewAcmpBenchmark.newCmpField thrpt 200 107.090 ± 0.083 ops/us
> NewAcmpBenchmark.oldCmp thrpt 200 114.466 ± 0.077 ops/us
>
> -XX:-TieredCompilation -XX:+UseNewAcmp -XX:ValueBasedClasses=compiler/valhalla/valuetypes/MyValue
> Benchmark Mode Cnt Score Error Units
> NewAcmpBenchmark.newCmp thrpt 200 101.480 ± 0.260 ops/us
> NewAcmpBenchmark.newCmpDoubleNull thrpt 200 90.429 ± 4.741 ops/us
> NewAcmpBenchmark.newCmpDoubleNullFalse thrpt 200 81.230 ± 4.115 ops/us
> NewAcmpBenchmark.newCmpField thrpt 200 102.224 ± 0.019 ops/us
> NewAcmpBenchmark.oldCmp thrpt 200 114.336 ± 0.239 ops/us
>
> In the worst case, if we need to emit the new acmp and the first operand is not null, there is a
> performance impact of 6.80% (see newCmp).
>
> However, in many cases we can use static type information to optimize. For example, if we know that
> one operand is a value type, we can emit a "double null check". This causes the performance impact
> to disappear into the noise (see newCmpDoubleNull). If we know in addition that one operand is
> always non-null, we can emit a static false. This improves performance by ~11% (high error) compared
> to old acmp.
>
> There is one pitfall. If we compare two object fields, C2 optimizes old acmp to directly compare the
> narrow oops (no need to decode). With the new acmp, we need to decode the oop because we use derived
> oops for perturbation. Surprisingly, the newCmpField benchmark shows that the regression is even
> lower than in the newCmp case (4.5%). That's probably because the comparison is always false and
> therefore the CPUs branch prediction works better, mitigating the performance impact of t> additional instructions.
I've executed an additional run with -XX:-UseCompressedOops and the performance results are still
the same. That means the overhead of decoding the oop is not measurable (in this microbenchmark).
> The last benchmark (oldCmp) verifies that if C2 is able to determine that one operand is not a value
> type, we can use the old acmp and performance is equal to the baseline.
>
> I will re-run the tests with type speculation enabled to see how much of a difference that makes.
I did and the results are rather surprising (as so often with these microbenchmarks). With the
baseline run having the exact same results as without type speculation (see above):
-XX:-TieredCompilation -XX:-UseNewAcmp -XX:TypeProfileLevel=222
Benchmark Mode Cnt Score Error Units
NewAcmpBenchmark.newCmp thrpt 200 108.911 ± 0.086 ops/us
NewAcmpBenchmark.newCmpDoubleNull thrpt 200 88.206 ± 4.792 ops/us
NewAcmpBenchmark.newCmpDoubleNullFalse thrpt 200 72.742 ± 7.563 ops/us
NewAcmpBenchmark.newCmpField thrpt 200 107.041 ± 0.142 ops/us
NewAcmpBenchmark.oldCmp thrpt 200 114.466 ± 0.077 ops/us
.. the performance of the patched run is much worse:
-XX:-TieredCompilation -XX:+UseNewAcmp -XX:TypeProfileLevel=222
-XX:ValueBasedClasses=compiler/valhalla/valuetypes/MyValue
Benchmark Mode Cnt Score Error Units
NewAcmpBenchmark.newCmp thrpt 200 95.040 ± 0.060 ops/us
NewAcmpBenchmark.newCmpDoubleNull thrpt 200 84.122 ± 4.117 ops/us
NewAcmpBenchmark.newCmpDoubleNullFalse thrpt 200 89.854 ± 5.014 ops/us
NewAcmpBenchmark.newCmpField thrpt 200 102.218 ± 0.020 ops/us
NewAcmpBenchmark.oldCmp thrpt 200 114.456 ± 0.077 ops/us
The problem is that with type speculation, C2 adds an uncommon trap (Java call) and although that
trap is never taken, we now need stack banging at method entry (see Compile::need_stack_bang()):
0x00007f62f8ea5f40: mov %eax,-0x14000(%rsp)
I've verified that disabling stack banging brings back performance (almost - it seems like method
size has also an impact) but the implicit null check does not improve performance in this case.
Thanks,
Tobias
More information about the valhalla-dev
mailing list