optimizing acmp in L-World

Fri Aug 24 14:05:26 UTC 2018

Hi Sergey,

On 22.08.2018 01:21, Sergey Kuksenko wrote:
> I'd rather suggest in case of branching to replace
> 
>  mov    0x8(%rcx),%r11d
>  mov    %r11,%r10
>  shr    $0x3,%r10
>  test   $0x1,%r10
>  jne...
> 
> with
> 
>  test    0x8(%rcx),$0x8
>  jne...
> 
> It will save registers which may be visible on highly inlined code.

Yes but that's only possible if we are testing against a single bit. That's the case with the
current klass pointer alignment trick but that will go away. In the future we will use a special bit
pattern in the mark word (markOopDesc::always_locked_pattern = 0x405) which has multiple bits set.
To test for a value type, we then need to do something like this:

  mov    $0x405,%r10d
  and    (%rcx),%r10
  cmp    $0x405,%r10
  je     -> is_value

Or this one:

  mov    $0xffffffffffffffff,%r10
  xor    (%rcx),%r10
  test   $0x405,%r10
  je     -> is_value

I'm currently working on updating the patch to use the mark word but I need more time. I've noticed
that the perturbing approach is much more difficult in that case. The best instruction sequence I
could come up with is something like:

  mov    (%rcx),%r10
  mov    $0x405,%r11d
  andn   %r11,%r10,%r10             // r10 = 0 for values, > 0 for others
  dec    %r10                       // r10 < 0 for values, >= 0 for others
  sar    $0x3f,%r10
  add    %r10,%rcx

Or this one:

  movabs $0x7ffffffffffffbfb,%r10   // MAX_LONG - 0x405 + 1
  mov    $0x405,%r11d
  and    (%rcx),%r11
  add    %r10,%r11                  // This will overflow to MIN_LONG for values
  sar    $0x3f,%r11
  add    %r11,%rcx

One could also use the set* instructions but we can currently not easily lower these from C2 IR:

  mov   (%rcx), %rdx
  xorl  %eax, %eax
  andl  $0x405, %edx
  cmpq  $0x405, %rdx
  sete  %al
  add   %rcx, %rax

I've also checked if the BMI instructions (other than "andn") would help but don't think so.

Looks like this complexity is an argument against the perturbation approach. I'll nevertheless
implement both approaches and report back once I have a working version.

> I just want to give you couple advices about microbenchmarks.
> - double DON_INLINE is not requred, having only on "cmpEq" is enough and you'll get less noisy
> invocations overhead.
> - dealing with small operations it's better to switch to benchmark mode to average time in
> nanoseconds - that allows to quickly notice if something wrong.

Okay, thanks. I'll update my benchmark accordingly.

> I am still thinking how to make representative set of acmp micros, to properly cover inlined case as
> the most important.
> I'll push them when they will be ready.

Sounds good!

I'm using the following test cases for correctness testing of the acmp specific C2 optimizations:
http://hg.openjdk.java.net/valhalla/valhalla/file/d8a6985f0b99/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestNewAcmp.java

Best regards,
Tobias