optimizing acmp in L-World

Fri Aug 31 13:04:14 UTC 2018

Hi Sergey,

here's a new webrev that uses the always locked mark word pattern and completely removes the klass
pointer alignment code:
http://cr.openjdk.java.net/~thartmann/valhalla/lworld/acmp_optimization/webrev.01/

Old acmp is now the default. The perturbation scheme can be enabled by -XX:+UsePointerPerturbation.

I've filed an enhancement for this and will follow up with a RFE once the numbers look good:
https://bugs.openjdk.java.net/browse/JDK-8210260

On 29.08.2018 22:30, Sergey Kuksenko wrote:
> I've done and put 'acmp' microbenchmarks into repository. You can find them in
> 'oracle.micro.valhalla.baseline.acmp' package. I think that is quite representative set. The key
> idea is to compare a set of references inside loop. There are two versions: where comparison is used
> for branching (if-condition) and comparison is used as boolean value. Also benchmarks have parameter
> which allows to control percentage of equals/not-equals values.

Great, thank you!

> Two issues related to -XX:+UseOldCmp were found:
> 1. Operation '!=' when 100% values are equal causes JVM crash (JMH options to reproduce:
> "IsCmp.isNotCmpBranch   -p eq=100")

Thanks, I was able to reproduce this, fixed the problem and added a regression test case.

> 2. -XX:+UseOldCmp  and PrintAssembly causes JVM crash (particularly when I tried to used "-prof 
> perfasm" JMH option)

I was not able to reproduce this with the latest webrev. I've tried running all acmp benchmarks with
-prof perfasm. Which benchmark/settings triggered this for you?

> Non-isolated benchmarks (when 'acmp' code is inlined) show that -XX:+UseOldCmp is always faster than
> -XX:-UseOldCmp.
> I've attached 4 charts showing this ('==' and '!=' operations, branching and boolean value)

Okay, thanks for the details. Could you re-evaluate this with the new webrev?

> And also there is another consideration related to CompressedOops. Before Valhalla, reference
> comparison don't care about uncompressing oops, because of comparison of compressed oops is enough.
> 
> -XX:-UseOldCmp performs uncompressing both references. That increase 'acmp' overhead.
> -XX:+UseOldCmp may uncompress only the one reference and only in case when compressed oops are equals.

Yes, that's true.

>> I have to say that according my measurements  -XX:+UseOldAcmp shows better performance in all
>> cases, even comparing the same references. I managed to make and repeat a corner case when the
>> same refs has better performance with -XX:-UseOldAcmp. The reason of this is the fact that
>> -XX:-UseOldAcmp generated code uses conditional move into result register, but -XX:+UseOldAcmp
>> generates branches. So if -XX:-UseOldAcmp will generate branches -it will be slower, or if
>> -XX:+UseOldAcmp won't generate branches - it will be faster.
>> I think benchmark measuring isolated acmp performance is not relevant to usages of acmp in
>> applications. Isolated - I mean - not inlined method with the result in register. Right now I am
>> trying to make a non isolated benchmark where acmp is used for condition and correspond branching.

Yes, acmp used in isolation should not be a common case. If it's hot, C2 will inline.

>> I've found yet another benchmarking pitfall here. Typically JMH executes all subbenchmarks in
>> separate VMs, that cases that measuring o1==o1 we that have only that branch in the profile. If
>> you want to measure full acmp performance, full - means when all acmp branches are in the profile,
>> you have to use yet another JMH option  "-wm BULK" which provides bulk warmup of all combinations
>> before measurement.

Yes but that depends on what you want to measure. We should also have benchmarks for the case where
C2 cuts of branches due to profile information suggesting that these are never taken.

Thanks,
Tobias