optimizing acmp in L-World
Sergey Kuksenko
sergey.kuksenko at oracle.com
Wed Aug 29 20:39:35 UTC 2018
Oops.
Attachments were cut.
Put it here http://cr.openjdk.java.net/~skuksenko/valhalla/acmp/
On 08/29/2018 01:30 PM, Sergey Kuksenko wrote:
> Hi Tobias,
>
> I've done and put 'acmp' microbenchmarks into repository. You can find
> them in 'oracle.micro.valhalla.baseline.acmp' package. I think that is
> quite representative set. The key idea is to compare a set of
> references inside loop. There are two versions: where comparison is
> used for branching (if-condition) and comparison is used as boolean
> value. Also benchmarks have parameter which allows to control
> percentage of equals/not-equals values.
>
> Two issues related to -XX:+UseOldCmp were found:
> 1. Operation '!=' when 100% values are equal causes JVM crash (JMH
> options to reproduce: "IsCmp.isNotCmpBranch -p eq=100")
> 2. -XX:+UseOldCmp and PrintAssembly causes JVM crash (particularly
> when I tried to used "-prof perfasm" JMH option)
>
> Non-isolated benchmarks (when 'acmp' code is inlined) show that
> -XX:+UseOldCmp is always faster than -XX:-UseOldCmp.
> I've attached 4 charts showing this ('==' and '!=' operations,
> branching and boolean value)
>
> And also there is another consideration related to CompressedOops.
> Before Valhalla, reference comparison don't care about uncompressing
> oops, because of comparison of compressed oops is enough.
>
> -XX:-UseOldCmp performs uncompressing both references. That increase
> 'acmp' overhead.
>
> -XX:+UseOldCmp may uncompress only the one reference and only in case
> when compressed oops are equals.
>
>
>
>
> On 08/27/2018 11:23 AM, Sergey Kuksenko wrote:
>> Hi Tobias,
>>
>> I've checked your benchmark and UseOldAcmp option.
>>
>> I have to say that according my measurements -XX:+UseOldAcmp shows
>> better performance in all cases, even comparing the same references.
>> I managed to make and repeat a corner case when the same refs has
>> better performance with -XX:-UseOldAcmp. The reason of this is the
>> fact that -XX:-UseOldAcmp generated code uses conditional move into
>> result register, but -XX:+UseOldAcmp generates branches. So if
>> -XX:-UseOldAcmp will generate branches -it will be slower, or if
>> -XX:+UseOldAcmp won't generate branches - it will be faster.
>> I think benchmark measuring isolated acmp performance is not relevant
>> to usages of acmp in applications. Isolated - I mean - not inlined
>> method with the result in register. Right now I am trying to make a
>> non isolated benchmark where acmp is used for condition and
>> correspond branching.
>>
>> I've found yet another benchmarking pitfall here. Typically JMH
>> executes all subbenchmarks in separate VMs, that cases that measuring
>> o1==o1 we that have only that branch in the profile. If you want to
>> measure full acmp performance, full - means when all acmp branches
>> are in the profile, you have to use yet another JMH option "-wm
>> BULK" which provides bulk warmup of all combinations before measurement.
>>
>> As for markOopDesc::always_locked_pattern = 0x405 - just tell me when
>> it'll be ready and I'll check it. But anyway cost of this should be
>> higher than 1 bit in class word.
>>
>> On 08/24/2018 07:05 AM, Tobias Hartmann wrote:
>>> Hi Sergey,
>>>
>>> On 22.08.2018 01:21, Sergey Kuksenko wrote:
>>>> I'd rather suggest in case of branching to replace
>>>>
>>>> mov 0x8(%rcx),%r11d
>>>> mov %r11,%r10
>>>> shr $0x3,%r10
>>>> test $0x1,%r10
>>>> jne...
>>>>
>>>> with
>>>>
>>>> test 0x8(%rcx),$0x8
>>>> jne...
>>>>
>>>> It will save registers which may be visible on highly inlined code.
>>> Yes but that's only possible if we are testing against a single bit.
>>> That's the case with the
>>> current klass pointer alignment trick but that will go away. In the
>>> future we will use a special bit
>>> pattern in the mark word (markOopDesc::always_locked_pattern =
>>> 0x405) which has multiple bits set.
>>> To test for a value type, we then need to do something like this:
>>>
>>> mov $0x405,%r10d
>>> and (%rcx),%r10
>>> cmp $0x405,%r10
>>> je -> is_value
>>>
>>> Or this one:
>>>
>>> mov $0xffffffffffffffff,%r10
>>> xor (%rcx),%r10
>>> test $0x405,%r10
>>> je -> is_value
>>>
>>> I'm currently working on updating the patch to use the mark word but
>>> I need more time. I've noticed
>>> that the perturbing approach is much more difficult in that case.
>>> The best instruction sequence I
>>> could come up with is something like:
>>>
>>> mov (%rcx),%r10
>>> mov $0x405,%r11d
>>> andn %r11,%r10,%r10 // r10 = 0 for values, > 0 for
>>> others
>>> dec %r10 // r10 < 0 for values, >= 0 for
>>> others
>>> sar $0x3f,%r10
>>> add %r10,%rcx
>>>
>>> Or this one:
>>>
>>> movabs $0x7ffffffffffffbfb,%r10 // MAX_LONG - 0x405 + 1
>>> mov $0x405,%r11d
>>> and (%rcx),%r11
>>> add %r10,%r11 // This will overflow to
>>> MIN_LONG for values
>>> sar $0x3f,%r11
>>> add %r11,%rcx
>>>
>>> One could also use the set* instructions but we can currently not
>>> easily lower these from C2 IR:
>>>
>>> mov (%rcx), %rdx
>>> xorl %eax, %eax
>>> andl $0x405, %edx
>>> cmpq $0x405, %rdx
>>> sete %al
>>> add %rcx, %rax
>>>
>>> I've also checked if the BMI instructions (other than "andn") would
>>> help but don't think so.
>>>
>>> Looks like this complexity is an argument against the perturbation
>>> approach. I'll nevertheless
>>> implement both approaches and report back once I have a working
>>> version.
>>>
>>>> I just want to give you couple advices about microbenchmarks.
>>>> - double DON_INLINE is not requred, having only on "cmpEq" is
>>>> enough and you'll get less noisy
>>>> invocations overhead.
>>>> - dealing with small operations it's better to switch to benchmark
>>>> mode to average time in
>>>> nanoseconds - that allows to quickly notice if something wrong.
>>> Okay, thanks. I'll update my benchmark accordingly.
>>>
>>>> I am still thinking how to make representative set of acmp micros,
>>>> to properly cover inlined case as
>>>> the most important.
>>>> I'll push them when they will be ready.
>>> Sounds good!
>>>
>>> I'm using the following test cases for correctness testing of the
>>> acmp specific C2 optimizations:
>>> http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java
>>>
>>>
>>> Best regards,
>>> Tobias
>>
>
More information about the valhalla-dev
mailing list