optimizing acmp in L-World
Tobias Hartmann
tobias.hartmann at oracle.com
Fri Aug 24 14:05:26 UTC 2018
Hi Sergey,
On 22.08.2018 01:21, Sergey Kuksenko wrote:
> I'd rather suggest in case of branching to replace
>
> mov 0x8(%rcx),%r11d
> mov %r11,%r10
> shr $0x3,%r10
> test $0x1,%r10
> jne...
>
> with
>
> test 0x8(%rcx),$0x8
> jne...
>
> It will save registers which may be visible on highly inlined code.
Yes but that's only possible if we are testing against a single bit. That's the case with the
current klass pointer alignment trick but that will go away. In the future we will use a special bit
pattern in the mark word (markOopDesc::always_locked_pattern = 0x405) which has multiple bits set.
To test for a value type, we then need to do something like this:
mov $0x405,%r10d
and (%rcx),%r10
cmp $0x405,%r10
je -> is_value
Or this one:
mov $0xffffffffffffffff,%r10
xor (%rcx),%r10
test $0x405,%r10
je -> is_value
I'm currently working on updating the patch to use the mark word but I need more time. I've noticed
that the perturbing approach is much more difficult in that case. The best instruction sequence I
could come up with is something like:
mov (%rcx),%r10
mov $0x405,%r11d
andn %r11,%r10,%r10 // r10 = 0 for values, > 0 for others
dec %r10 // r10 < 0 for values, >= 0 for others
sar $0x3f,%r10
add %r10,%rcx
Or this one:
movabs $0x7ffffffffffffbfb,%r10 // MAX_LONG - 0x405 + 1
mov $0x405,%r11d
and (%rcx),%r11
add %r10,%r11 // This will overflow to MIN_LONG for values
sar $0x3f,%r11
add %r11,%rcx
One could also use the set* instructions but we can currently not easily lower these from C2 IR:
mov (%rcx), %rdx
xorl %eax, %eax
andl $0x405, %edx
cmpq $0x405, %rdx
sete %al
add %rcx, %rax
I've also checked if the BMI instructions (other than "andn") would help but don't think so.
Looks like this complexity is an argument against the perturbation approach. I'll nevertheless
implement both approaches and report back once I have a working version.
> I just want to give you couple advices about microbenchmarks.
> - double DON_INLINE is not requred, having only on "cmpEq" is enough and you'll get less noisy
> invocations overhead.
> - dealing with small operations it's better to switch to benchmark mode to average time in
> nanoseconds - that allows to quickly notice if something wrong.
Okay, thanks. I'll update my benchmark accordingly.
> I am still thinking how to make representative set of acmp micros, to properly cover inlined case as
> the most important.
> I'll push them when they will be ready.
Sounds good!
I'm using the following test cases for correctness testing of the acmp specific C2 optimizations:
http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java
Best regards,
Tobias
More information about the valhalla-dev
mailing list