optimizing acmp in L-World
Sergey Kuksenko
sergey.kuksenko at oracle.com
Wed Aug 29 20:30:09 UTC 2018
Hi Tobias,
I've done and put 'acmp' microbenchmarks into repository. You can find
them in 'oracle.micro.valhalla.baseline.acmp' package. I think that is
quite representative set. The key idea is to compare a set of references
inside loop. There are two versions: where comparison is used for
branching (if-condition) and comparison is used as boolean value. Also
benchmarks have parameter which allows to control percentage of
equals/not-equals values.
Two issues related to -XX:+UseOldCmp were found:
1. Operation '!=' when 100% values are equal causes JVM crash (JMH
options to reproduce: "IsCmp.isNotCmpBranch -p eq=100")
2. -XX:+UseOldCmp and PrintAssembly causes JVM crash (particularly when
I tried to used "-prof perfasm" JMH option)
Non-isolated benchmarks (when 'acmp' code is inlined) show that
-XX:+UseOldCmp is always faster than -XX:-UseOldCmp.
I've attached 4 charts showing this ('==' and '!=' operations, branching
and boolean value)
And also there is another consideration related to CompressedOops.
Before Valhalla, reference comparison don't care about uncompressing
oops, because of comparison of compressed oops is enough.
-XX:-UseOldCmp performs uncompressing both references. That increase
'acmp' overhead.
-XX:+UseOldCmp may uncompress only the one reference and only in case
when compressed oops are equals.
On 08/27/2018 11:23 AM, Sergey Kuksenko wrote:
> Hi Tobias,
>
> I've checked your benchmark and UseOldAcmp option.
>
> I have to say that according my measurements -XX:+UseOldAcmp shows
> better performance in all cases, even comparing the same references. I
> managed to make and repeat a corner case when the same refs has better
> performance with -XX:-UseOldAcmp. The reason of this is the fact that
> -XX:-UseOldAcmp generated code uses conditional move into result
> register, but -XX:+UseOldAcmp generates branches. So if
> -XX:-UseOldAcmp will generate branches -it will be slower, or if
> -XX:+UseOldAcmp won't generate branches - it will be faster.
> I think benchmark measuring isolated acmp performance is not relevant
> to usages of acmp in applications. Isolated - I mean - not inlined
> method with the result in register. Right now I am trying to make a
> non isolated benchmark where acmp is used for condition and correspond
> branching.
>
> I've found yet another benchmarking pitfall here. Typically JMH
> executes all subbenchmarks in separate VMs, that cases that measuring
> o1==o1 we that have only that branch in the profile. If you want to
> measure full acmp performance, full - means when all acmp branches are
> in the profile, you have to use yet another JMH option "-wm BULK"
> which provides bulk warmup of all combinations before measurement.
>
> As for markOopDesc::always_locked_pattern = 0x405 - just tell me when
> it'll be ready and I'll check it. But anyway cost of this should be
> higher than 1 bit in class word.
>
> On 08/24/2018 07:05 AM, Tobias Hartmann wrote:
>> Hi Sergey,
>>
>> On 22.08.2018 01:21, Sergey Kuksenko wrote:
>>> I'd rather suggest in case of branching to replace
>>>
>>> mov 0x8(%rcx),%r11d
>>> mov %r11,%r10
>>> shr $0x3,%r10
>>> test $0x1,%r10
>>> jne...
>>>
>>> with
>>>
>>> test 0x8(%rcx),$0x8
>>> jne...
>>>
>>> It will save registers which may be visible on highly inlined code.
>> Yes but that's only possible if we are testing against a single bit.
>> That's the case with the
>> current klass pointer alignment trick but that will go away. In the
>> future we will use a special bit
>> pattern in the mark word (markOopDesc::always_locked_pattern = 0x405)
>> which has multiple bits set.
>> To test for a value type, we then need to do something like this:
>>
>> mov $0x405,%r10d
>> and (%rcx),%r10
>> cmp $0x405,%r10
>> je -> is_value
>>
>> Or this one:
>>
>> mov $0xffffffffffffffff,%r10
>> xor (%rcx),%r10
>> test $0x405,%r10
>> je -> is_value
>>
>> I'm currently working on updating the patch to use the mark word but
>> I need more time. I've noticed
>> that the perturbing approach is much more difficult in that case. The
>> best instruction sequence I
>> could come up with is something like:
>>
>> mov (%rcx),%r10
>> mov $0x405,%r11d
>> andn %r11,%r10,%r10 // r10 = 0 for values, > 0 for
>> others
>> dec %r10 // r10 < 0 for values, >= 0 for
>> others
>> sar $0x3f,%r10
>> add %r10,%rcx
>>
>> Or this one:
>>
>> movabs $0x7ffffffffffffbfb,%r10 // MAX_LONG - 0x405 + 1
>> mov $0x405,%r11d
>> and (%rcx),%r11
>> add %r10,%r11 // This will overflow to
>> MIN_LONG for values
>> sar $0x3f,%r11
>> add %r11,%rcx
>>
>> One could also use the set* instructions but we can currently not
>> easily lower these from C2 IR:
>>
>> mov (%rcx), %rdx
>> xorl %eax, %eax
>> andl $0x405, %edx
>> cmpq $0x405, %rdx
>> sete %al
>> add %rcx, %rax
>>
>> I've also checked if the BMI instructions (other than "andn") would
>> help but don't think so.
>>
>> Looks like this complexity is an argument against the perturbation
>> approach. I'll nevertheless
>> implement both approaches and report back once I have a working version.
>>
>>> I just want to give you couple advices about microbenchmarks.
>>> - double DON_INLINE is not requred, having only on "cmpEq" is enough
>>> and you'll get less noisy
>>> invocations overhead.
>>> - dealing with small operations it's better to switch to benchmark
>>> mode to average time in
>>> nanoseconds - that allows to quickly notice if something wrong.
>> Okay, thanks. I'll update my benchmark accordingly.
>>
>>> I am still thinking how to make representative set of acmp micros,
>>> to properly cover inlined case as
>>> the most important.
>>> I'll push them when they will be ready.
>> Sounds good!
>>
>> I'm using the following test cases for correctness testing of the
>> acmp specific C2 optimizations:
>> http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java
>>
>>
>> Best regards,
>> Tobias
>
More information about the valhalla-dev
mailing list