optimizing acmp in L-World

Wed Aug 29 20:30:09 UTC 2018

Hi Tobias,

I've done and put 'acmp' microbenchmarks into repository. You can find 
them in 'oracle.micro.valhalla.baseline.acmp' package. I think that is 
quite representative set. The key idea is to compare a set of references 
inside loop. There are two versions: where comparison is used for 
branching (if-condition) and comparison is used as boolean value. Also 
benchmarks have parameter which allows to control percentage of 
equals/not-equals values.

Two issues related to -XX:+UseOldCmp were found:
1. Operation '!=' when 100% values are equal causes JVM crash (JMH 
options to reproduce: "IsCmp.isNotCmpBranch   -p eq=100")
2. -XX:+UseOldCmp  and PrintAssembly causes JVM crash (particularly when 
I tried to used "-prof  perfasm" JMH option)

Non-isolated benchmarks (when 'acmp' code is inlined) show that 
-XX:+UseOldCmp is always faster than -XX:-UseOldCmp.
I've attached 4 charts showing this ('==' and '!=' operations, branching 
and boolean value)

And also there is another consideration related to CompressedOops. 
Before Valhalla, reference comparison don't care about uncompressing 
oops, because of comparison of compressed oops is enough.

-XX:-UseOldCmp performs uncompressing both references. That increase 
'acmp' overhead.

-XX:+UseOldCmp may uncompress only the one reference and only in case 
when compressed oops are equals.

On 08/27/2018 11:23 AM, Sergey Kuksenko wrote:
> Hi Tobias,
>
> I've checked your benchmark and UseOldAcmp option.
>
> I have to say that according my measurements  -XX:+UseOldAcmp shows 
> better performance in all cases, even comparing the same references. I 
> managed to make and repeat a corner case when the same refs has better 
> performance with -XX:-UseOldAcmp. The reason of this is the fact that 
> -XX:-UseOldAcmp generated code uses conditional move into result 
> register, but -XX:+UseOldAcmp generates branches. So if 
> -XX:-UseOldAcmp will generate branches -it will be slower, or if 
> -XX:+UseOldAcmp won't generate branches - it will be faster.
> I think benchmark measuring isolated acmp performance is not relevant 
> to usages of acmp in applications. Isolated - I mean - not inlined 
> method with the result in register. Right now I am trying to make a 
> non isolated benchmark where acmp is used for condition and correspond 
> branching.
>
> I've found yet another benchmarking pitfall here. Typically JMH 
> executes all subbenchmarks in separate VMs, that cases that measuring 
> o1==o1 we that have only that branch in the profile. If you want to 
> measure full acmp performance, full - means when all acmp branches are 
> in the profile, you have to use yet another JMH option  "-wm BULK" 
> which provides bulk warmup of all combinations before measurement.
>
> As for markOopDesc::always_locked_pattern = 0x405 - just tell me when 
> it'll be ready and I'll check it. But anyway cost of this should be 
> higher than 1 bit in class word.
>
> On 08/24/2018 07:05 AM, Tobias Hartmann wrote:
>> Hi Sergey,
>>
>> On 22.08.2018 01:21, Sergey Kuksenko wrote:
>>> I'd rather suggest in case of branching to replace
>>>
>>>   mov    0x8(%rcx),%r11d
>>>   mov    %r11,%r10
>>>   shr    $0x3,%r10
>>>   test   $0x1,%r10
>>>   jne...
>>>
>>> with
>>>
>>>   test    0x8(%rcx),$0x8
>>>   jne...
>>>
>>> It will save registers which may be visible on highly inlined code.
>> Yes but that's only possible if we are testing against a single bit. 
>> That's the case with the
>> current klass pointer alignment trick but that will go away. In the 
>> future we will use a special bit
>> pattern in the mark word (markOopDesc::always_locked_pattern = 0x405) 
>> which has multiple bits set.
>> To test for a value type, we then need to do something like this:
>>
>>    mov    $0x405,%r10d
>>    and    (%rcx),%r10
>>    cmp    $0x405,%r10
>>    je     -> is_value
>>
>> Or this one:
>>
>>    mov    $0xffffffffffffffff,%r10
>>    xor    (%rcx),%r10
>>    test   $0x405,%r10
>>    je     -> is_value
>>
>> I'm currently working on updating the patch to use the mark word but 
>> I need more time. I've noticed
>> that the perturbing approach is much more difficult in that case. The 
>> best instruction sequence I
>> could come up with is something like:
>>
>>    mov    (%rcx),%r10
>>    mov    $0x405,%r11d
>>    andn   %r11,%r10,%r10             // r10 = 0 for values, > 0 for 
>> others
>>    dec    %r10                       // r10 < 0 for values, >= 0 for 
>> others
>>    sar    $0x3f,%r10
>>    add    %r10,%rcx
>>
>> Or this one:
>>
>>    movabs $0x7ffffffffffffbfb,%r10   // MAX_LONG - 0x405 + 1
>>    mov    $0x405,%r11d
>>    and    (%rcx),%r11
>>    add    %r10,%r11                  // This will overflow to 
>> MIN_LONG for values
>>    sar    $0x3f,%r11
>>    add    %r11,%rcx
>>
>> One could also use the set* instructions but we can currently not 
>> easily lower these from C2 IR:
>>
>>    mov   (%rcx), %rdx
>>    xorl  %eax, %eax
>>    andl  $0x405, %edx
>>    cmpq  $0x405, %rdx
>>    sete  %al
>>    add   %rcx, %rax
>>
>> I've also checked if the BMI instructions (other than "andn") would 
>> help but don't think so.
>>
>> Looks like this complexity is an argument against the perturbation 
>> approach. I'll nevertheless
>> implement both approaches and report back once I have a working version.
>>
>>> I just want to give you couple advices about microbenchmarks.
>>> - double DON_INLINE is not requred, having only on "cmpEq" is enough 
>>> and you'll get less noisy
>>> invocations overhead.
>>> - dealing with small operations it's better to switch to benchmark 
>>> mode to average time in
>>> nanoseconds - that allows to quickly notice if something wrong.
>> Okay, thanks. I'll update my benchmark accordingly.
>>
>>> I am still thinking how to make representative set of acmp micros, 
>>> to properly cover inlined case as
>>> the most important.
>>> I'll push them when they will be ready.
>> Sounds good!
>>
>> I'm using the following test cases for correctness testing of the 
>> acmp specific C2 optimizations:
>> http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java 
>>
>>
>> Best regards,
>> Tobias
>