Unsafe.{get,put}-X-Unaligned; Efficient array comparison intrinsics

Tue Feb 24 23:18:23 UTC 2015

On Feb 23, 2015, at 10:13 AM, Andrew Haley <aph at redhat.com> wrote:
> 
> I've been kicking around a few ideas for Unsafe access methods for
> unaligned access to byte arrays and buffers in order to provide
> "whatever second-best mechanism the platform offers".  These would
> provide the base for fast lexicographic array comparisons, etc.
> 
> https://bugs.openjdk.java.net/browse/JDK-8044082

Bravo.  This will help Panama's native data access also.

> If the platform supports unaligned memory accesses, the implementation
> of {get,put}-X-Unaligned is obvious and trivial for both C1 and C2.
> It gets interesting when we want to provide efficient unaligned
> methods on machines with no hardware support.

The HotSpot JVM has repertoire of internal C++ methods for
performing unaligned access.  They will sometimes be better
tuned then the byte-marshalling idiom you propose in Unsafe.
An optimizing JIT will not reliably recognize and compress the
byte-marshalling idiom back down to a hardware intrinsic.

And it is not just a question of deferring to the C++ version
of the same byte-marshalling code.  Some platforms choose
to test p&7 for several cases, not just zero and non-zero.
Also, although it is not common, some chips support misaligned
memory operations as distinct instructions. MIPS and Alpha
come to mind, and SPARC has VIS instructions to manage
vector loops on misaligned data.  A rare example in x86 is
MOVDQA vs. MOVDQU, although alignment should be
ignored for nearly all x86 ops.

My bottom line:  I think we should use the internal HotSpot
API bytes.hpp by surfacing relevant parts of it up into Unsafe.
(One thing feels wrong about bytes.hpp:  It insists that big-endian
is the norm for unaligned access.  This is simply a legacy of its
origin, for classfile and bytecode parsing.)

Looking at the various platform code for bytes.hpp, I suppose
one could argue that there are just two cases of interest:
A single instruction (x86/arm) and a complicated 4-way
switch (sparc/ppc).  Those could be represented in Java
code as you suggest, at the cost of some duplication.
For special platforms which use a third idiom, the intrinsics
could be adjusted.

Distinct big and little endian access modes are also available
on some machines, such as SPARC (ASI_LITTLE, etc.).
But I do *not* believe that this capability should be surfaced
as distinct intrinsics in Unsafe.  Many cpus and source bases
deal with endian-matching simply by adjoining a separate
"byte swap" operation to the memory access.  In Java,
this is already an intrinsic, {Long,Integer,...}.reverseBytes.
And SPARC already optimizes some compositions of
byte reversal and memory access to its special ASI_LITTLE
instructions.

My second bottom line:  Don't multiply endian options.
Use reverseBytes calls instead.

Suggestion:  Have getIntUnaligned take an optional boolean
parameter, which is "bigEndian" (since that's relatively exceptional).
An extra line of code can conditionally swap the bytes, taking
both the flag and the platform into account.  Default value of the
boolean is whatever is natural to the platform.  If you specifically
want Java's big-endian order, you specify true, etc.

> We could provide compiler intrinsics which do when we need on such
> machines.  However, I think this wouldn't deliver the best results.
> From the experiments I've done, the best implementation is to write
> the access methods in Java and allow HotSpot to optimize them.  While
> this seemed a bit counter-intuitive to me, it's best because C2 has
> profile data that it can work on.  In many cases I suspect that data
> read and written from a byte array will be aligned for their type and
> C2 will take advantage of this, relegating the misaligned access to an
> out-of-line code path as appropriate.  Also, these methods have the
> additional benefit that they are always atomic as long as the data are
> naturally aligned.

As coded in your proposal, they do not preserve partial atomicity.
That's (probably) why platforms distinguish more cases among
p&7 values.

Partial atomicity is important when you are vectorizing a loop
over an array of multi-byte elements (such as Java chars).
You don't want racy reads to pick up torn array elements.

(BTW, a TO-DO item for the optimizing JIT is to track low-bits of
pointers and other values; see JDK-8001436 for nice details.
This would allow the JIT to eliminate some tests of p&7.)

> This does result in rather a lot of code for the methods for all sizes
> and endiannesses, but none of it will be used on machines with
> unaligned hardware support except in the interpreter.  (Perhaps the
> interpreter too could have intrinsics?)

Well, it does for the aligned guys.  Adding unaligned versions doubles
the set (once again), but I don't see any other effective way of getting
to the platform-specific logic (bytes.hpp) mentioned above.

> I have changed HeapByteBuffer to use these methods, with a major
> performance improvement.  I've also provided Unsafe methods to query
> endianness and alignment support.
> 
> Webrevs at http://cr.openjdk.java.net/~aph/unaligned.hotspot.1/
> http://cr.openjdk.java.net/~aph/unaligned.jdk.1/