[foreign-memaccess] musing on the memory access API

Fri Jan 8 11:18:01 UTC 2021

Hi Lee

<snip>

> This relates to an issue that I'm concerned about, but perhaps because
> I don't fully understand why  "memory access var handles *need* exact 
> type
> information to be *fast*" or is this just a convention?

With VarHandle, there are a number of things you have to do for them to 
be fast (by fast I mean for them to "inline" like other methods do):

1) the VarHandle should be stored in a final static variable (e.g. a VM 
constant)
2) the VarHandle invocation should respect the types with which the VH 
was created

In practical terms, when you do a VH::get, it's like if the VM spinned a 
new method just for that on the fly (of course it's more complex that 
that); the spinned method has a signature. If the VM has guarantees that 
the parameters you are calling this method with are the same as those of 
the spinned method, then a bunch of checks are removed (the dreaded 
asType() adaptations), and you are in performance nirvana :-)

>   At the CPU level, it
> ingests chunks of bytes and then extracts whatever type specified by the
> assembly instruction whether it be a 32-bit integer (signed or unsigned),
> short, long, float, double or whatever.  I would like the ability to 
> create a
> MemorySegment allocated as bytes, load it with longs (for speed) and
> then read it with a MemoryLayout that describes some complex multi-type
> data structure (because I know what the bytes represent!).  In other 
> words,
> MemorySegment should act like a blob of bytes and reading and writing
> from it should behave like a /C union/ overlayed with a /C struct./
> I realize this violates the Java principles of strict typing, but if 
> we really
> are interested in speed, we need this ability (even if you force us to
> declare it as /unsafe/).  I'm sure you have thought
> about this, but I'm not sure, yet, if this is a reality in the code.
>
> This already appears in Java in a few very limited cases. E.g., I can 
> view a
> /double/ as raw bits, perform operations on the raw bits as a long, and
> convert it back to a double.  We have some math routines that take
> advantage of this.  What is unfortunate is the lack of being able to
> convert a double (or long, etc) into bytes and back at an intrinsic level,
> which should be very fast.
>
> I looked at your spliterator and it is not clear how I would use it to 
> view
> the same blob of bytes with two different layouts.  I must be missing
> something :(.

What you describe already flows naturally from the design of the API - 
since a memory segment has no concept of type/endianness, it is up to 
the client doing the dereference to use the right type 
primitives/endianness primitives. Consider this:

MemorySegment ms = MemorySegment.allocateNative(8);
byte b = MemoryAccess.getByteAtOffset(ms, 0);
short s = MemoryAccess.getShortAtOffset(ms, 0);
float f = MemoryAccess.getFloatAtOffset(ms, 0, ByteOrder.BIG_ENDIAN);
...

In other words, when you dereference a segment, the type of var handle 
being used determines (a) which carrier you should use for dereferencing 
and (b) which endianness you want.

To simplify things (and avoid mistakes), the API comes with the concept 
of "memory layout" - that is, you can define, programmatically, the 
layout of the memory you want to dereference - e.g.

var layout = MemoryLayout.ofStruct(
      MemoryLayouts.JAVA_INT.withName("a"),
      MemoryLayout.ofPaddingBits(16),
      MemoryLayouts.JAVA_SHORT.withName("b"));

And then get a VarHandle pointing at "b":

VarHandle bHandle = layout.varHandle(short.class, 
PathElement.groupElement("b"));

This will create a var handle with right offset (and dynamic stride, if 
the access element is inside an array) and correct endianness/type. This 
avoids the need for typical offset computation and/or remembering which 
endianness should be used to access a given field. In other words, 
access is made more declarative - you just create a bunch of VarHandles, 
one for each layout element you wanna access.

>
>     * I believe/hope that the main gripes you had with the byte buffer API
>     (which seem to be endianness related) are gone with the memory access
>     API. There we made the decision of leaving endianness outside of the
>     MemorySegment - e.g. endianness is a property of the VarHandle
>     doing the
>     access, not a property of the segment per se. I believe this decision
>     paid off (e.g. our segments are completely orthogonal w.r.t. layout
>     decisions), and avoids a lot of confusion as to "what's the
>     default" etc.
>
>
> I have a number of gripes about the ByteBuffer.
>
> 1. The most serious issue is the handling of endianness.
> First, the default is BigEndian, which today makes no sense as nearly all
> CPUs are LE.  And, some byte compression algorithms only work with a given
> endianness.  Perhaps I could live with this, but if I am interested in 
> performance
> I would like to match my CPU, so I dutifully set endianness to LE.
>
>     ByteBuffer bb = ByteBuffer.allocate(16);
>
>     bb.order(ByteOrder.LITTLE_ENDIAN);
>
> Later, suppose I need to do any one of the following common operations:
> slice(), duplicate() or asReadOnlyBuffer().
>
> *    The ByteBuffer silently reverts back to BigEndian!*
>
> So the engineer must magically know to always reset the desired 
> endianness after
> every one of those common operations.  And, by the way, this is not 
> documented
> in the Javadocs anywhere I could find.
>
> This is the cause of many difficult to find bugs!  In fact we have 
> cases where
> in large segments of data that have been stored into historical 
> archives, the
> same segment will have different parts of it encoded with LE and other 
> parts
> in BE!  This is a maintenance nightmare.
>
> This bug is easy to find in the ByteBuffer source code. The calls to 
> slice(),
> duplicate() and asReadOnlyBuffer() return a new ByteBuffer without 
> copying
> over the current state of Endianness.

I found this issue annoying too - and encountered many times when 
writing tests against the memory access API and comparing results with 
ByteBuffer. As for the BIG_ENDIAN default, I guess any default is gonna 
be doomed one way or another here (although, yes, LITTLE_ENDIAN is way 
more common) - but I think having endianness explicit in the dereference 
is even better :-)

(our static accessor assumes native endianness if no endianness argument 
is specified - which can be a source of subtle issues, but we tried to 
strike a balance between generality and usability).

>
> This is why in our Memory Package implementation we made endianness
> immutable, once it is chosen, and all equivalent calls to slice(), 
> duplicate(),
> etc() retain the state of endianness.
>
> 2. ByteBuffer array handling is clumsy.  It was designed strictly from 
> an IO
> streaming use-case with no alternative for absolute addressing like the
> single primitive methods.  The BB API is
>
>     ByteBuffer put(<type>[] src, int srcOffset, int length);
>
>
> Our use case has the need to put or get an array at an absolute offset
> from the beginning of the buffer. For example,
>
>     ByteBuffer put(long bufferOffset, <type>[] src, int srcOffset, int
>     length);
>
>
> Attempting to replicate this method with the current BB API requires:
>
>   * Saving the current setting of position and limit (if used)
>   * Setting the position, computing and perhaps checking the limit
>   * executing the put() above,
>   * restoring position and limit.
>
> This is a real PITA, and could be so easily solved with a few easy to add
> methods.

I think new absolute bulk methods are now part of the BB API - I can 
defo see a put method which takes an initial offset:

https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/ByteBuffer.html#put(int,byte%5B%5D,int,int)

> 3.  There is no method that allows a high-performance (system level)
> copy of a region of one ByteBuffer to another ByteBuffer without going
> through the heap.  This is so easy to do with Unsafe, I hope you have
> the ability to do this with MemorySegments.  What we need is something 
> like
>
>     static void copy(MemorySegment src, long srcOffsetBytes, 
>
>     MemorySegment dst, long dstOffsetBytes, long lengthBytes)
>
We have something like that in the MemorySegment API - namely 
MemorySegment::copyFrom.

You take two segments, slice them as required, then call copyFrom to 
(bulk) copy contents of one slice into the other slice. No intermediate 
copy required.

That said, I believe here the BB API also caught up recently:

https://github.com/openjdk/jdk/commit/a50fdd54

You can see a new method to copy a bytebuffer into another with certain 
region boundaries.

>
> Since there are no java arrays involved, the length could be a long.
> Under the covers, you could easily go parallel with multiple threads if
> the size is big.
Long vs. int is the real problem here which cannot be addressed by the 
BB API. I believe the other issues you list under (3) have more to do 
with the Java 8 version of the BB API.
>
> 4. The handling of the positional values is also clumsy IMHO 
> where, for example,
> the Mark is silently invalidated.   Agreed this is documented, but 
> remembering
> the rules where the positionals are suddenly silently changed can be 
> difficult
> unless you do it all the time.  I designed a different positional 
> system 
> <https://urldefense.com/v3/__https://datasketches.apache.org/api/memory/snapshot/apidocs/index.html__;!!GqivPVa7Brio!PpQ5sYMmSDw-HJ_hz7MxYIN-vlDNsvGRvdQ4hUczgD0n0jk7oCwg3eiyDODeM1cGXF4OT6Y$> (see 
> BaseBuffer) where there is no need to invalidate them.

My personal feeling on this is that the machinery behind ByteBuffer 
position/limit/mark is specifically designed for simplifying IO 
operations, encoders and such. Carrying them around for pure off-heap 
usage is always going to be problematic, as the users will have to 
remember which API to use (otherwise a client might affect remaining 
clients) - it also makes the implementation more stateful, and less 
likely to optimize well (on paper, because from what we've seen, in 
reality performance of BB often matches that of unsafe access - 
especially after Java 14). The MemorySegment API here adopts a "less is 
better" approach, as there's no relative positioning in that API - the 
only "mutable" bit of state in a memory segment is the bit that tells 
you whether the segment is alive or not - everything else is a constant.

Maurizio

>
> I hope you find this of interest.
>
> Cheers,
>
> Lee.
>
>