optimizing acmp in L-World
Sergey Kuksenko
sergey.kuksenko at oracle.com
Mon Aug 27 18:23:56 UTC 2018
Hi Tobias,
I've checked your benchmark and UseOldAcmp option.
I have to say that according my measurements -XX:+UseOldAcmp shows
better performance in all cases, even comparing the same references. I
managed to make and repeat a corner case when the same refs has better
performance with -XX:-UseOldAcmp. The reason of this is the fact that
-XX:-UseOldAcmp generated code uses conditional move into result
register, but -XX:+UseOldAcmp generates branches. So if -XX:-UseOldAcmp
will generate branches -it will be slower, or if -XX:+UseOldAcmp won't
generate branches - it will be faster.
I think benchmark measuring isolated acmp performance is not relevant to
usages of acmp in applications. Isolated - I mean - not inlined method
with the result in register. Right now I am trying to make a non
isolated benchmark where acmp is used for condition and correspond
branching.
I've found yet another benchmarking pitfall here. Typically JMH executes
all subbenchmarks in separate VMs, that cases that measuring o1==o1 we
that have only that branch in the profile. If you want to measure full
acmp performance, full - means when all acmp branches are in the
profile, you have to use yet another JMH option "-wm BULK" which
provides bulk warmup of all combinations before measurement.
As for markOopDesc::always_locked_pattern = 0x405 - just tell me when
it'll be ready and I'll check it. But anyway cost of this should be
higher than 1 bit in class word.
On 08/24/2018 07:05 AM, Tobias Hartmann wrote:
> Hi Sergey,
>
> On 22.08.2018 01:21, Sergey Kuksenko wrote:
>> I'd rather suggest in case of branching to replace
>>
>> mov 0x8(%rcx),%r11d
>> mov %r11,%r10
>> shr $0x3,%r10
>> test $0x1,%r10
>> jne...
>>
>> with
>>
>> test 0x8(%rcx),$0x8
>> jne...
>>
>> It will save registers which may be visible on highly inlined code.
> Yes but that's only possible if we are testing against a single bit. That's the case with the
> current klass pointer alignment trick but that will go away. In the future we will use a special bit
> pattern in the mark word (markOopDesc::always_locked_pattern = 0x405) which has multiple bits set.
> To test for a value type, we then need to do something like this:
>
> mov $0x405,%r10d
> and (%rcx),%r10
> cmp $0x405,%r10
> je -> is_value
>
> Or this one:
>
> mov $0xffffffffffffffff,%r10
> xor (%rcx),%r10
> test $0x405,%r10
> je -> is_value
>
> I'm currently working on updating the patch to use the mark word but I need more time. I've noticed
> that the perturbing approach is much more difficult in that case. The best instruction sequence I
> could come up with is something like:
>
> mov (%rcx),%r10
> mov $0x405,%r11d
> andn %r11,%r10,%r10 // r10 = 0 for values, > 0 for others
> dec %r10 // r10 < 0 for values, >= 0 for others
> sar $0x3f,%r10
> add %r10,%rcx
>
> Or this one:
>
> movabs $0x7ffffffffffffbfb,%r10 // MAX_LONG - 0x405 + 1
> mov $0x405,%r11d
> and (%rcx),%r11
> add %r10,%r11 // This will overflow to MIN_LONG for values
> sar $0x3f,%r11
> add %r11,%rcx
>
> One could also use the set* instructions but we can currently not easily lower these from C2 IR:
>
> mov (%rcx), %rdx
> xorl %eax, %eax
> andl $0x405, %edx
> cmpq $0x405, %rdx
> sete %al
> add %rcx, %rax
>
> I've also checked if the BMI instructions (other than "andn") would help but don't think so.
>
> Looks like this complexity is an argument against the perturbation approach. I'll nevertheless
> implement both approaches and report back once I have a working version.
>
>> I just want to give you couple advices about microbenchmarks.
>> - double DON_INLINE is not requred, having only on "cmpEq" is enough and you'll get less noisy
>> invocations overhead.
>> - dealing with small operations it's better to switch to benchmark mode to average time in
>> nanoseconds - that allows to quickly notice if something wrong.
> Okay, thanks. I'll update my benchmark accordingly.
>
>> I am still thinking how to make representative set of acmp micros, to properly cover inlined case as
>> the most important.
>> I'll push them when they will be ready.
> Sounds good!
>
> I'm using the following test cases for correctness testing of the acmp specific C2 optimizations:
> http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java
>
> Best regards,
> Tobias
More information about the valhalla-dev
mailing list