Unsafe.{get,put}-X-Unaligned; Efficient array comparison intrinsics

Tue Feb 24 14:47:22 UTC 2015

On Feb 24, 2015, at 2:48 PM, Andrew Haley <aph at redhat.com> wrote:
>> 
>> I am all for keeping more code in Java if we can. I don't know enough about assembler-based optimizations to determine if it might be possible to do better on certain CPU architectures.
> 
> Me either, but I have tested this on the architectures I have, and I
> suspect that C2 optimization is good enough.  And we'd have to write
> assembly code for machines we haven't got; something for the future, I
> think.
> 

Agreed.

>> One advantage, AFAIU, to intrinsics is they are not subject to the vagaries of inlining thresholds. It's important that the loops operating over the arrays to be compiled efficiently otherwise performance can drop off the cliff if thresholds are reached within the loop. Perhaps these methods are small enough it is not an issue? and also perhaps that is not a sufficient argument to justify the cost of an intrinsic (and we should be really tweaking the inlining mechanism)?
> 
> Maybe so.  There are essentially two ways to do this: new compiler
> node types which punt everything to the back end (and therefore
> require back-end authors to write them) or generic expanders, which is
> how many of the existing intrinsics are done.  Generic C2 code would,
> I suspect, be worse than writing this in Java bacause it would be
> lacking profile data.
> 
>> With that in mind is there any need to intrinsify the new methods at all given those new Java methods can defer to the older ones based on a constant check? Also should that anyway be done for the interpreter?
>> 
>> 
>> private static final boolean IS_UNALIGNED = theUnsafe.unalignedAccess();
>> 
>> public void putIntUnaligned(Object o, long offset, int x) { if (IS_UNALIGNED || (offset & 3) == 0) { putInt(o, offset, x); } else if (byteOrder == BIG_ENDIAN) { putIntB(o, offset, x); } else { putIntL(o, offset, x); } }
> 
> Yes.  It certainly could be done like this but I think C1 doesn't do
> the optimization to remove the IS_UNALIGNED test, so we'd still want
> the C1 builtins.

Hmm.... if i run the following code on my Mac:

import java.nio.ByteOrder;

public class Y {

    static final boolean X = ByteOrder.nativeOrder() == ByteOrder.LITTLE_ENDIAN;     // Just get a constant from somewhere...

    public static void main(String[] args) {
        Y y = new Y();

        int s = 0;
        for (int i = 0; i < 1000000; i++) {
            s += y.x(i);
        }
        System.out.println(s);
    }

    public int x(int i) {
        if (X || (i & 3) == 0) {
            return i + i;
        }
        else {
            return Integer.sum(i, i);
        }
    }
}

With the options:

  -XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

Then Y.x is compiled to:

  0x000000011155ea80: mov    %eax,-0x14000(%rsp)
  0x000000011155ea87: push   %rbp
  0x000000011155ea88: sub    $0x30,%rsp         ;*getstatic X
                                                ; - Y::x at 0 (line 43)

  0x000000011155ea8c: mov    %rdx,%rax
  0x000000011155ea8f: add    %edx,%eax
  0x000000011155ea91: add    $0x30,%rsp
  0x000000011155ea95: pop    %rbp
  0x000000011155ea96: test   %eax,-0x62ca9c(%rip)        # 0x0000000110f32000
                                                ;   {poll_return}
  0x000000011155ea9c: retq   

For tiered compilation i do see code generated with a test, jump and de-opt:

  0x00000001112fe020: mov    %eax,-0x14000(%rsp)
  0x00000001112fe027: push   %rbp
  0x00000001112fe028: sub    $0x30,%rsp
  0x00000001112fe02c: mov    $0x125633860,%rax  ;   {metadata(method data for {method} {0x0000000125633508} 'x' '(I)I' in 'Y')}
  0x00000001112fe036: mov    0xdc(%rax),%edi
  0x00000001112fe03c: add    $0x8,%edi
  0x00000001112fe03f: mov    %edi,0xdc(%rax)
  0x00000001112fe045: mov    $0x125633508,%rax  ;   {metadata({method} {0x0000000125633508} 'x' '(I)I' in 'Y')}
  0x00000001112fe04f: and    $0x1ff8,%edi
  0x00000001112fe055: cmp    $0x0,%edi
  0x00000001112fe058: je     0x00000001112fe07f  ;*getstatic X
                                                ; - Y::x at 0 (line 43)

  0x00000001112fe05e: mov    $0x125633860,%rax  ;   {metadata(method data for {method} {0x0000000125633508} 'x' '(I)I' in 'Y')}
  0x00000001112fe068: incl   0x118(%rax)        ;*ifne
                                                ; - Y::x at 3 (line 43)

  0x00000001112fe06e: mov    %rdx,%rax
  0x00000001112fe071: add    %edx,%eax
  0x00000001112fe073: add    $0x30,%rsp
  0x00000001112fe077: pop    %rbp
  0x00000001112fe078: test   %eax,-0x62f07e(%rip)        # 0x0000000110ccf000
                                                ;   {poll_return}
  0x00000001112fe07e: retq   
  0x00000001112fe07f: mov    %rax,0x8(%rsp)
  0x00000001112fe084: movq   $0xffffffffffffffff,(%rsp)
  0x00000001112fe08c: callq  0x0000000110e63c60  ; OopMap{rsi=Oop off=145}
                                                ;*synchronization entry
                                                ; - Y::x at -1 (line 43)
                                                ;   {runtime_call}

But the method gets compiled later on to the shape of the former code e.g.:

    471   16       3       Y::x (22 bytes)
    472   17       4       Y::x (22 bytes)
    472   16       3       Y::x (22 bytes)   made not entrant

>  Perhaps we could do without the C2 builtins but they
> cost very little, they save C2 a fair amount of work, and they remove
> the vagaries of inlining.

Tis true, the changes are small.

>  I take your point about the interpreter,
> though.
> 
>> I see you optimized the unaligned getLong by reading two aligned longs and then bit twiddled. It seems harder to optimize the putLong by straddling an aligned putInt with one to three required putByte.
> 
> Sure, that's always a possibility.  I have code to do it but it was
> all getting rather complicated for my taste.

I thought it might.

> 
>>> Also, these methods have the additional benefit that they are always atomic as long as the data are naturally aligned.
>> 
>> We should probably document that in general access is not guaranteed to be atomic and an implementation detail that it currently is when naturally so.
> 
> I think that's a good idea.  The jcstress tests already come up with a
> warning that the implementation is not atomic; this is not required,
> but a high-quality implementation will be.
> 
>>> This does result in rather a lot of code for the methods for all sizes and endiannesses, but none of it will be used on machines with unaligned hardware support except in the interpreter.  (Perhaps the interpreter too could have intrinsics?)
>>> 
>>> I have changed HeapByteBuffer to use these methods, with a major performance improvement.  I've also provided Unsafe methods to query endianness and alignment support.
>> 
>> If we expose the endianness query via a new method in unsafe we should reuse that in java.nio.Bits and get rid of the associated static code block.
> 
> Sure, I already did that.
> 

Locally i guess? (just in case i missed something in the current webrev).

Paul.