Unsafe.{get,put}-X-Unaligned; Efficient array comparison intrinsics

Tue Feb 24 13:48:06 UTC 2015

Hi,

On 02/24/2015 11:16 AM, Paul Sandoz wrote:

> This looks like a good start.

Good, thanks.

> On Feb 23, 2015, at 7:13 PM, Andrew Haley <aph at redhat.com> wrote:
> 
>> I've been kicking around a few ideas for Unsafe access methods for unaligned access to byte arrays and buffers in order to provide "whatever second-best mechanism the platform offers".  These would provide the base for fast lexicographic array comparisons, etc.
>> 
>> https://bugs.openjdk.java.net/browse/JDK-8044082
>> 
>> If the platform supports unaligned memory accesses, the implementation of {get,put}-X-Unaligned is obvious and trivial for both C1 and C2. It gets interesting when we want to provide efficient unaligned methods on machines with no hardware support.
>> 
>> We could provide compiler intrinsics which do when we need on such machines.  However, I think this wouldn't deliver the best results. From the experiments I've done, the best implementation is to write the access methods in Java and allow HotSpot to optimize them.  While this seemed a bit counter-intuitive to me, it's best because C2 has profile data that it can work on.  In many cases I suspect that data read and written from a byte array will be aligned for their type and C2 will take advantage of this, relegating the misaligned access to an out-of-line code path as appropriate.
> 
> I am all for keeping more code in Java if we can. I don't know enough about assembler-based optimizations to determine if it might be possible to do better on certain CPU architectures.

Me either, but I have tested this on the architectures I have, and I
suspect that C2 optimization is good enough.  And we'd have to write
assembly code for machines we haven't got; something for the future, I
think.

> One advantage, AFAIU, to intrinsics is they are not subject to the vagaries of inlining thresholds. It's important that the loops operating over the arrays to be compiled efficiently otherwise performance can drop off the cliff if thresholds are reached within the loop. Perhaps these methods are small enough it is not an issue? and also perhaps that is not a sufficient argument to justify the cost of an intrinsic (and we should be really tweaking the inlining mechanism)?

Maybe so.  There are essentially two ways to do this: new compiler
node types which punt everything to the back end (and therefore
require back-end authors to write them) or generic expanders, which is
how many of the existing intrinsics are done.  Generic C2 code would,
I suspect, be worse than writing this in Java bacause it would be
lacking profile data.

> With that in mind is there any need to intrinsify the new methods at all given those new Java methods can defer to the older ones based on a constant check? Also should that anyway be done for the interpreter?
> 
> 
> private static final boolean IS_UNALIGNED = theUnsafe.unalignedAccess();
> 
> public void putIntUnaligned(Object o, long offset, int x) { if (IS_UNALIGNED || (offset & 3) == 0) { putInt(o, offset, x); } else if (byteOrder == BIG_ENDIAN) { putIntB(o, offset, x); } else { putIntL(o, offset, x); } }

Yes.  It certainly could be done like this but I think C1 doesn't do
the optimization to remove the IS_UNALIGNED test, so we'd still want
the C1 builtins.  Perhaps we could do without the C2 builtins but they
cost very little, they save C2 a fair amount of work, and they remove
the vagaries of inlining.  I take your point about the interpreter,
though.

> I see you optimized the unaligned getLong by reading two aligned longs and then bit twiddled. It seems harder to optimize the putLong by straddling an aligned putInt with one to three required putByte.

Sure, that's always a possibility.  I have code to do it but it was
all getting rather complicated for my taste.

>> Also, these methods have the additional benefit that they are always atomic as long as the data are naturally aligned.
> 
> We should probably document that in general access is not guaranteed to be atomic and an implementation detail that it currently is when naturally so.

I think that's a good idea.  The jcstress tests already come up with a
warning that the implementation is not atomic; this is not required,
but a high-quality implementation will be.

>> This does result in rather a lot of code for the methods for all sizes and endiannesses, but none of it will be used on machines with unaligned hardware support except in the interpreter.  (Perhaps the interpreter too could have intrinsics?)
>> 
>> I have changed HeapByteBuffer to use these methods, with a major performance improvement.  I've also provided Unsafe methods to query endianness and alignment support.
> 
> If we expose the endianness query via a new method in unsafe we should reuse that in java.nio.Bits and get rid of the associated static code block.

Sure, I already did that.

Thanks,
Andrew.