optimizing acmp in L-World

Wed Aug 29 20:39:35 UTC 2018

Oops.

Attachments were cut.

Put it here http://cr.openjdk.java.net/~skuksenko/valhalla/acmp/

On 08/29/2018 01:30 PM, Sergey Kuksenko wrote:
> Hi Tobias,
>
> I've done and put 'acmp' microbenchmarks into repository. You can find 
> them in 'oracle.micro.valhalla.baseline.acmp' package. I think that is 
> quite representative set. The key idea is to compare a set of 
> references inside loop. There are two versions: where comparison is 
> used for branching (if-condition) and comparison is used as boolean 
> value. Also benchmarks have parameter which allows to control 
> percentage of equals/not-equals values.
>
> Two issues related to -XX:+UseOldCmp were found:
> 1. Operation '!=' when 100% values are equal causes JVM crash (JMH 
> options to reproduce: "IsCmp.isNotCmpBranch   -p eq=100")
> 2. -XX:+UseOldCmp  and PrintAssembly causes JVM crash (particularly 
> when I tried to used "-prof  perfasm" JMH option)
>
> Non-isolated benchmarks (when 'acmp' code is inlined) show that 
> -XX:+UseOldCmp is always faster than -XX:-UseOldCmp.
> I've attached 4 charts showing this ('==' and '!=' operations, 
> branching and boolean value)
>
> And also there is another consideration related to CompressedOops. 
> Before Valhalla, reference comparison don't care about uncompressing 
> oops, because of comparison of compressed oops is enough.
>
> -XX:-UseOldCmp performs uncompressing both references. That increase 
> 'acmp' overhead.
>
> -XX:+UseOldCmp may uncompress only the one reference and only in case 
> when compressed oops are equals.
>
>
>
>
> On 08/27/2018 11:23 AM, Sergey Kuksenko wrote:
>> Hi Tobias,
>>
>> I've checked your benchmark and UseOldAcmp option.
>>
>> I have to say that according my measurements  -XX:+UseOldAcmp shows 
>> better performance in all cases, even comparing the same references. 
>> I managed to make and repeat a corner case when the same refs has 
>> better performance with -XX:-UseOldAcmp. The reason of this is the 
>> fact that -XX:-UseOldAcmp generated code uses conditional move into 
>> result register, but -XX:+UseOldAcmp generates branches. So if 
>> -XX:-UseOldAcmp will generate branches -it will be slower, or if 
>> -XX:+UseOldAcmp won't generate branches - it will be faster.
>> I think benchmark measuring isolated acmp performance is not relevant 
>> to usages of acmp in applications. Isolated - I mean - not inlined 
>> method with the result in register. Right now I am trying to make a 
>> non isolated benchmark where acmp is used for condition and 
>> correspond branching.
>>
>> I've found yet another benchmarking pitfall here. Typically JMH 
>> executes all subbenchmarks in separate VMs, that cases that measuring 
>> o1==o1 we that have only that branch in the profile. If you want to 
>> measure full acmp performance, full - means when all acmp branches 
>> are in the profile, you have to use yet another JMH option  "-wm 
>> BULK" which provides bulk warmup of all combinations before measurement.
>>
>> As for markOopDesc::always_locked_pattern = 0x405 - just tell me when 
>> it'll be ready and I'll check it. But anyway cost of this should be 
>> higher than 1 bit in class word.
>>
>> On 08/24/2018 07:05 AM, Tobias Hartmann wrote:
>>> Hi Sergey,
>>>
>>> On 22.08.2018 01:21, Sergey Kuksenko wrote:
>>>> I'd rather suggest in case of branching to replace
>>>>
>>>>   mov    0x8(%rcx),%r11d
>>>>   mov    %r11,%r10
>>>>   shr    $0x3,%r10
>>>>   test   $0x1,%r10
>>>>   jne...
>>>>
>>>> with
>>>>
>>>>   test    0x8(%rcx),$0x8
>>>>   jne...
>>>>
>>>> It will save registers which may be visible on highly inlined code.
>>> Yes but that's only possible if we are testing against a single bit. 
>>> That's the case with the
>>> current klass pointer alignment trick but that will go away. In the 
>>> future we will use a special bit
>>> pattern in the mark word (markOopDesc::always_locked_pattern = 
>>> 0x405) which has multiple bits set.
>>> To test for a value type, we then need to do something like this:
>>>
>>>    mov    $0x405,%r10d
>>>    and    (%rcx),%r10
>>>    cmp    $0x405,%r10
>>>    je     -> is_value
>>>
>>> Or this one:
>>>
>>>    mov    $0xffffffffffffffff,%r10
>>>    xor    (%rcx),%r10
>>>    test   $0x405,%r10
>>>    je     -> is_value
>>>
>>> I'm currently working on updating the patch to use the mark word but 
>>> I need more time. I've noticed
>>> that the perturbing approach is much more difficult in that case. 
>>> The best instruction sequence I
>>> could come up with is something like:
>>>
>>>    mov    (%rcx),%r10
>>>    mov    $0x405,%r11d
>>>    andn   %r11,%r10,%r10             // r10 = 0 for values, > 0 for 
>>> others
>>>    dec    %r10                       // r10 < 0 for values, >= 0 for 
>>> others
>>>    sar    $0x3f,%r10
>>>    add    %r10,%rcx
>>>
>>> Or this one:
>>>
>>>    movabs $0x7ffffffffffffbfb,%r10   // MAX_LONG - 0x405 + 1
>>>    mov    $0x405,%r11d
>>>    and    (%rcx),%r11
>>>    add    %r10,%r11                  // This will overflow to 
>>> MIN_LONG for values
>>>    sar    $0x3f,%r11
>>>    add    %r11,%rcx
>>>
>>> One could also use the set* instructions but we can currently not 
>>> easily lower these from C2 IR:
>>>
>>>    mov   (%rcx), %rdx
>>>    xorl  %eax, %eax
>>>    andl  $0x405, %edx
>>>    cmpq  $0x405, %rdx
>>>    sete  %al
>>>    add   %rcx, %rax
>>>
>>> I've also checked if the BMI instructions (other than "andn") would 
>>> help but don't think so.
>>>
>>> Looks like this complexity is an argument against the perturbation 
>>> approach. I'll nevertheless
>>> implement both approaches and report back once I have a working 
>>> version.
>>>
>>>> I just want to give you couple advices about microbenchmarks.
>>>> - double DON_INLINE is not requred, having only on "cmpEq" is 
>>>> enough and you'll get less noisy
>>>> invocations overhead.
>>>> - dealing with small operations it's better to switch to benchmark 
>>>> mode to average time in
>>>> nanoseconds - that allows to quickly notice if something wrong.
>>> Okay, thanks. I'll update my benchmark accordingly.
>>>
>>>> I am still thinking how to make representative set of acmp micros, 
>>>> to properly cover inlined case as
>>>> the most important.
>>>> I'll push them when they will be ready.
>>> Sounds good!
>>>
>>> I'm using the following test cases for correctness testing of the 
>>> acmp specific C2 optimizations:
>>> http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java 
>>>
>>>
>>> Best regards,
>>> Tobias
>>
>