optimizing acmp in L-World

Mon Aug 27 18:23:56 UTC 2018

Hi Tobias,

I've checked your benchmark and UseOldAcmp option.

I have to say that according my measurements  -XX:+UseOldAcmp shows 
better performance in all cases, even comparing the same references. I 
managed to make and repeat a corner case when the same refs has better 
performance with -XX:-UseOldAcmp. The reason of this is the fact that 
-XX:-UseOldAcmp generated code uses conditional move into result 
register, but -XX:+UseOldAcmp generates branches. So if -XX:-UseOldAcmp 
will generate branches  -it will be slower, or if -XX:+UseOldAcmp won't 
generate branches - it will be faster.
I think benchmark measuring isolated acmp performance is not relevant to 
usages of acmp in applications. Isolated - I mean - not inlined method 
with the result in register. Right now I am trying to make a non 
isolated benchmark where acmp is used for condition and correspond 
branching.

I've found yet another benchmarking pitfall here. Typically JMH executes 
all subbenchmarks in separate VMs, that cases that measuring o1==o1 we 
that have only that branch in the profile. If you want to measure full 
acmp performance, full - means when all acmp branches are in the 
profile, you have to use yet another JMH option  "-wm BULK" which 
provides bulk warmup of all combinations before measurement.

As for markOopDesc::always_locked_pattern = 0x405 - just tell me when 
it'll be ready and I'll check it. But anyway cost of this should be 
higher than 1 bit in class word.

On 08/24/2018 07:05 AM, Tobias Hartmann wrote:
> Hi Sergey,
>
> On 22.08.2018 01:21, Sergey Kuksenko wrote:
>> I'd rather suggest in case of branching to replace
>>
>>   mov    0x8(%rcx),%r11d
>>   mov    %r11,%r10
>>   shr    $0x3,%r10
>>   test   $0x1,%r10
>>   jne...
>>
>> with
>>
>>   test    0x8(%rcx),$0x8
>>   jne...
>>
>> It will save registers which may be visible on highly inlined code.
> Yes but that's only possible if we are testing against a single bit. That's the case with the
> current klass pointer alignment trick but that will go away. In the future we will use a special bit
> pattern in the mark word (markOopDesc::always_locked_pattern = 0x405) which has multiple bits set.
> To test for a value type, we then need to do something like this:
>
>    mov    $0x405,%r10d
>    and    (%rcx),%r10
>    cmp    $0x405,%r10
>    je     -> is_value
>
> Or this one:
>
>    mov    $0xffffffffffffffff,%r10
>    xor    (%rcx),%r10
>    test   $0x405,%r10
>    je     -> is_value
>
> I'm currently working on updating the patch to use the mark word but I need more time. I've noticed
> that the perturbing approach is much more difficult in that case. The best instruction sequence I
> could come up with is something like:
>
>    mov    (%rcx),%r10
>    mov    $0x405,%r11d
>    andn   %r11,%r10,%r10             // r10 = 0 for values, > 0 for others
>    dec    %r10                       // r10 < 0 for values, >= 0 for others
>    sar    $0x3f,%r10
>    add    %r10,%rcx
>
> Or this one:
>
>    movabs $0x7ffffffffffffbfb,%r10   // MAX_LONG - 0x405 + 1
>    mov    $0x405,%r11d
>    and    (%rcx),%r11
>    add    %r10,%r11                  // This will overflow to MIN_LONG for values
>    sar    $0x3f,%r11
>    add    %r11,%rcx
>
> One could also use the set* instructions but we can currently not easily lower these from C2 IR:
>
>    mov   (%rcx), %rdx
>    xorl  %eax, %eax
>    andl  $0x405, %edx
>    cmpq  $0x405, %rdx
>    sete  %al
>    add   %rcx, %rax
>
> I've also checked if the BMI instructions (other than "andn") would help but don't think so.
>
> Looks like this complexity is an argument against the perturbation approach. I'll nevertheless
> implement both approaches and report back once I have a working version.
>
>> I just want to give you couple advices about microbenchmarks.
>> - double DON_INLINE is not requred, having only on "cmpEq" is enough and you'll get less noisy
>> invocations overhead.
>> - dealing with small operations it's better to switch to benchmark mode to average time in
>> nanoseconds - that allows to quickly notice if something wrong.
> Okay, thanks. I'll update my benchmark accordingly.
>
>> I am still thinking how to make representative set of acmp micros, to properly cover inlined case as
>> the most important.
>> I'll push them when they will be ready.
> Sounds good!
>
> I'm using the following test cases for correctness testing of the acmp specific C2 optimizations:
> http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java
>
> Best regards,
> Tobias