Part 1 Proposal for JDK-8264594

Thu May 20 09:00:27 UTC 2021

On 20/05/2021 02:34, leerho wrote:
> Maurizio,
> Sorry about my radio silence, I was pulled onto another urgent 
> project, but now I'm back :)
> From your email above:
>
>     E.g. let's keep the MemorySegment API thin, by adding only one method:
>
>     /copyFrom(MemorySegment, MemoryLayout elemLayout, ByteOrder order)/
>
>     Which generalizes the existing copyFrom, allowing for swap copy.
>
>
> I am a little unclear how the user would use this to copy some 
> primitive array to/from a MemorySegment.  Let me try:

There were two parts to my proposal :-)

1. add a copyFrom method which works from segment <-> segment but also 
takes an endianness/swappiness
2. add a MemoryCopy (parallel with MemoryAccess) full of static methods 
to copy things from java arrays to segments and back

You need (1) to implement (2). (1) is a primitive, (2) is a class with 
user convenience static methods.

Maurizio

>
> For example: to copy a double array into a MemorySegment would require:
>
> //Given:
> int arrLen = N;
> double[] srcArr = new double[arrLen];
> ResourceScope scope = ...
> MemorySegment dstSeg = ... big enough, but we need to load at a dstOffset.
>
> //prepare the source segment:
> MemoryLayout doubleLayout = valueLayout(64, ByteOrder.nativeOrder);
> MemoryLayout seqLayout = sequenceLayout(arrLen, doubleLayout);
> MemorySegment srcSeg = MemorySegment.ofArray(srcArr);
>
> //prepare the destination segment:
> dstSegSlice = dstSeg.asSlice(dstOffset, arrLen*8);
> dstSeg.copyFrom(srcSeg, seqLayout, ByteOrder.nativeOrder);
>
> Whew!   This can't be right. I must be doing something wrong.  How did 
> you intend for this new method to be used?
>
> Cheers,
>
> Lee.
>
>
>
>
>
>
>
> On Thu, Apr 29, 2021 at 3:23 AM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com 
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
>
>     On 29/04/2021 05:49, leerho wrote:
>>     Maurizio,
>>
>>     First, I would like to clear up some misunderstandings.
>>
>>     1. Re overloading with/without ByteOrder:
>>
>>         My feeling is that having a "more friendly" overload for what
>>         is a pretty advanced method anyway feels some sort of middle
>>         ground that would double the size of the API w/o really
>>         making it more usable (I don't thing users like you would be
>>         bothered by an extra `nativeOrder()` suffix). At least that's
>>         what I was assuming.
>>
>>
>>     I was trying to make the distinction between 2 vs 4 methods,
>>     assuming Object array), and 14 vs 28 methods assuming individual
>>     XYZArrays. The second case is clearly a more egregious impact on
>>     the API so I would definitely agree that not overloading the
>>     ByteOrder makes sense. *And yes, I can live with that :)*
>>
>>     The 2 vs 4 methods refers to the next item.
>     Yeah - I realized later that you were considering different cost
>     vs. benefit ratios in the two proposals - in one case it seems to
>     cost more than the other, I agree. (I also agree, in general, that
>     if we could provide a ByteOrder-less version, it would be nice to
>     have).
>>
>>     2. Use of Object array:
>>
>>         Accepting an Object parameter is inconsistent with e.g. the
>>         MemorySegment.of(...) factories, which take sharp array
>>         types. In principle those factories could be compressed down
>>         into one (in fact the API started off that way, but we went
>>         back for performance issues).
>>
>>     That is why I posed the question wrt use of Object array:
>>
>>         Concern: Performance.  I would think we would be moving work
>>         that could be done at compile time (e.g., what type of array
>>         is it? ) to run time.   Or would Hotspot figure this out ?
>>
>>
>>     I wasn't sure, which is why I asked. And you indicated that
>>     performance with an Object array is an issue *and I believe you* :)
>
>     There is a difference between MemorySegment.of(Object) and
>     MemorySegment.copyInto(Object). In the first operation we are
>     "creating" a new segment; I have observed that, in general, it is
>     better to have creation of new instances as a transparent process
>     as possible: if different methods create different concrete Java
>     instances (for instance MemorySegment::ofArray(int[]) creates a
>     different object than MemorySegment::ofArray(double[])) the code
>     tends to be more "transparent" to the optimizing compiler, and
>     typically that works better than having a single factory which
>     does a bunch of stuff/control flow and might return different
>     objects (which will necessarily create some pollution). I believe
>     the case of memory copy is not dissimilar from what
>     MemoryAccess.getInt does: note that you just pass a `memory
>     segment` - we don't have overloads for "heap segment", "mapped
>     segment" and "native segment" and yet C2 is able to figure that
>     out (because we turned enhanced profiling on for these methods). A
>     similar trick, I'm sure could be done here, if we were willing to.
>     Of course we need benchmarks to test and validate the approach.
>
>>
>>     3. Native implementations:
>>
>>         You seem to assume that "native implementation" == better
>>         performance.
>>
>>
>>     No, I don't really.  My only comment about native methods is that
>>     I don't know where to find the C++ code so I can't figure out
>>     what is going on under the covers.  Additionally, as you pointed
>>     out, the C2 hotspot compiler has additional profiling it can take
>>     advantage of ... and I even know less about that! *So I'll leave
>>     it to you what methods should be made native*.
>
>     So, there are two routines for copying memory: System::arrayCopy
>     and Unsafe::copyMemory. Performance-wise they behave the same - I
>     believe one minor difference is that System.arrayCopy inserts some
>     GC safe point so that, in the middle of a big copy, it can still
>     pause for GC - Unsafe::copyMemory (the one we use) doesn't do
>     that. They are both native routines, and are both "intrinsified"
>     by the VM - which means that when they are called enough time, the
>     VM replaces them with bits of machine code generated by the
>     optimizing compiler. This is where information about types becomes
>     crucial: if, when the intrinsification occurs, C2 can prove that
>     e.g. you are copying from off-heap into a int[], then a sharper
>     copy can be performed. If the types are opaque (so C2 doesn't know
>     whether they are on/off heap), then performance suffers. So, as
>     long as the types are kept "sharp", this is typically not an issue.
>
>     There's also another important consideration: copyMemory (or
>     System::arrayCopy) is, relatively speaking, a "slow" operation
>     (since it can touch many addresses). In my experience it is way
>     less likely to observe performance issues with this routines than
>     it is with a single-shot memory access (which boils down to a
>     single MOV, so every extra overhead counts big time!).
>
>
>>
>>     *Comment*: In our Apache DataSketches library we have implemented
>>     all of our sketching algorithms in both C++ and in Java.  And we
>>     have experienced a number of cases where our highly-tuned java
>>     implementation handily beats a casually constructed C++
>>     implementation!   We have found that there are similar issues in
>>     both languages.  If the developer tends to over-rely on the
>>     general-purpose STL librarys in C++ it will impact performance in
>>     the same way that over-reliance on the Java Collections library
>>     will.  But a well-tuned C++ implementation will generally beat a
>>     well-tuned Java implementation, but not by very much!  And with
>>     this Panama module, I am hoping we can make the difference even less.
>>
>>     I think we can agree that the above 3 issues can be put aside and
>>     don't need further discussion..
>     ok
>>
>>     ****
>>     Back to the method design:
>>
>>         I think if we use an instance method we should just drop the
>>         length parameter from the method and infer the length from
>>         the receiver segment.
>>
>>
>>     This does present some usability concerns as I will highlight next.
>>
>>     Thoughts on your 2 X 2 table:
>>     I think we don't need to focus on the two cases where either the
>>     segment or the array does not exist.
>
>     Sure - I was showing them to demonstrate the emerging symmetry/pattern
>
>
>>     These can already be handled with the current API.  Now for the
>>     other two cases:
>>
>>     1. From array to existing segment, you wrote:
>>
>>         segment [.asSlice(...)] .copyFrom(array, index, order)
>>
>>
>>     My translation:
>>
>>         dstSegment.asSlice(long dstOffsetBytes, long
>>         dstLengthBytes).copyFromArray(XYZ[] srcArray, int srcIndex,
>>         ByteOrder order);
>>
>>       * The user generally has better knowledge of the array
>>         coordinates (start, length). Here the user would have to
>>         translate from array srcElements to dstLengthBytes.  I don't
>>         consider this friendly.
>>       * I don't think you intended to overload the existing
>>         copyFrom(segment), thus, copyFromArray(...).
>>       * The new methods /copyFromArray(XYZ[] srcArray, srcIndex,
>>         order) are /not as general purpose as they would it be if it
>>         included the array length argument, but leads to the problem
>>         of dual specification of the length both in bytes and array
>>         elements.
>>
>     I think of these three I can see the point of the first bullet -
>     but it feels like having offset/length on the copyFrom feels
>     slightly redundant (since asSlice does kind of the same thing, and
>     that's what we tell people to do for segment->segment copyFrom).
>     Anyway, point taken.
>>     2. From segment to existing array, you wrote:
>>
>>         segment [.asSlice(...)] .toXYZArray(array, index, order)
>>
>>
>>     My translation:
>>
>>         srcSegment.asSlice(long dstOffsetBytes, long
>>         dstLengthBytes).copyToArray(XYZ[] dstArray, int dstIndex,
>>         ByteOrder order)
>>
>>       * Again, the user would have to translate from his dstElements
>>         to dstLengthBytes.
>>       * Again, the new methods /copyToArray(XYZ[] dstArray, dstIndex,
>>         order)/ are not as general purpose as mentioned above. So the
>>         above comment about the copyFromArray(...) apply here as well.
>>
>>     My conclusion, so far, is that splitting our task into 7 new
>>     copyFromArray(...) methods and 7 new copyToArray(...) methods
>>     that must operate in a chain, provides a clumsy solution.
>
>     I think "clumsy" here might be a bit of a glass-half-empty
>     comment. And some points as to why the glass is perceived to be
>     half empty are probably well taken (e.g. translation of
>     coordinates) but I think there is a general "I don't want to
>     slice" feel to all this, which sounds like jumping into premature
>     optimization. I'd like to make sure we resist that temptation, no
>     matter what we end up doing in the end.
>
>
>
>>
>>     *So what is our task?*
>>
>>     What we want our method(s) to accomplish, *at the lowest level,*
>>     is actually very simple.  We want to copy /lengthBytes/ from one
>>     blob of bytes to another blob of bytes, each with a specific
>>     offset.  As a diagram, it looks like this:
>>     ArrayCopyForMemorySegment.png
>>     Let's forget about arrays of XYZ for a moment.  We could
>>     accomplish this with a simple method like :
>>
>>         void copy(source, srcOffsetBytes, destination,
>>         dstOffsetBytes, lengthBytes);
>>
>>
>>     The beauty of the unsafe methods is that under the covers,
>>     everything is bytes.  And if you /unsafe.putInt//(offset, int)/
>>     into memory, unsafe views it as just 4 bytes and it forgets that
>>     it ever was an integer.
>>
>>     A MemorySegment is just a blob of bytes somewhere in memory.  So
>>     if we just had
>>
>>         void copy(srcSegment, srcOffsetBytes, dstSegment,
>>         dstOffsetBytes, lengthBytes);
>>
>>
>>     *This would already be a powerful method on its own!  (And we
>>     don't have this!)*
>
>     Ok, now the thing I was afraid of (premature optimization) is
>     sneaking in full - it's not true we don't have support for the
>     above - we do, but not in the way you described:
>
>     sourceSegment.slice(....).copyFrom(destSegment.slice(...))
>
>     Combining copyFrom and asSlice you can obtain whatever combination
>     of copying you can obtain with low level Unsafe::copyMemory.
>
>     We're not talking about missing primitives here (it's all in
>     there). And, simplicity is often in the eye of the beholder: you
>     seem to imply that a method taking 5 (!!) arguments is better than
>     having a 1-arg method and moving the slicing closer to where it
>     belongs. I think I object to that. In the slicing version there's
>     no ambiguity as to what parameters refer to.
>
>     The main reason as to why I would see someone prefer the 5
>     argument version is, again, some notion that slicing costs too much.
>
>
>     Anyway - copying segment to segment is a completely different
>     problems from copying segments into an existing array. I believe
>     the former is addressed by the API (and I don't see a significant
>     margin of improvements there). In the latter case, I agree there
>     is too much convoluted-ness going on, with creating a temp segment
>     for holding heap memory and what's not.
>
>
>>     In addition, we would like to enhance this method so that either
>>     the source or the destination could be one of the 7 primitive
>>     Java arrays.
>>
>>     The task for the method that needs to do this is trivial: To
>>     convert from array units to bytes is a simple left shift. To
>>     convert from bytes to array units is a simple right-shift
>>     instruction.  Both of these are single-cycle CPU instructions.
>>     The /copy/ should be a low-level bulk copy operation. This method
>>     should be as fast as a comparable operation in C/C++.  I don't
>>     see any reason why this couldn't be implemented as a native
>>     method.  In fact, the lowest level of this operation could
>>     probably be accomplished with a few machine instructions.
>>     (However, a really powerful bulk array copy that takes advantage
>>     of cache-line sizes and non-aligned offsets is more complex; but
>>     this kind of copy method should already exist in the JVM.)
>>
>>     For arrays, the simplest solution, IMHO, would be 7 copyToArray()
>>     methods and 7 copyFromArray() methods structured as follows:
>>
>>         void copyFromArray(XYZ[] srcArray, int srcStartIndex, int
>>         srcElements,
>>             MemorySegment dstSegment, long dstOffsetBytes, ByteOrder
>>         order);
>>
>>     and
>>
>>         void copyToArray(MemorySegment srcSegment, long srcOffsetBytes,
>>             XYZ[] dstArray, int dstStartIndex, int dstElements,
>>         ByteOrder order) 
>>
>>       * The simple translation of array units to bytes and the
>>         reverse is handled by the method.
>>         Thus, the user does not need to bother doing this, which
>>         helps eliminate human error.
>>       * There is no duplication of the length specification, it is
>>         only specified once.
>>       * The method names are self explanatory including the relevant
>>         units, where relevant
>>       * As these methods are self contained and with no escapes. It
>>         could easily be a static method.
>>       * These methods solve the complete problem. No need to chain
>>         multiple methods together.
>>       * These methods could be part of MemorySegment, MemoryAccess,
>>         or its own class.
>>
>     If this path is preferred, and we also join in with Radoslaw
>     comment, I think I'd like to add all the required overloads
>     (including ByteOrder-less ones) into a new class: MemoryCopy.
>
>     E.g. let's keep the MemorySegment API thin, by adding only one method:
>
>     copyFrom(MemorySegment, MemoryLayout elemLayout, ByteOrder order)
>
>     Which generalizes the existing copyFrom, allowing for swap copy.
>
>     And then let's build on this, and provide several static methods
>     on the side, like we do with MemoryAccess.
>
>     Maurizio
>
>
>>     Cheers, and thank you for your patience!
>>
>>     Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>     On Wed, Apr 28, 2021 at 2:49 AM Maurizio Cimadamore
>>     <maurizio.cimadamore at oracle.com
>>     <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>>         Hi Lee
>>         IMHO some of the disadvantages you mention about 2 are a bit
>>         accidental - and not dissimilar to what happens here:
>>
>>         https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/Collection.html#toArray(T%5B%5D)
>>         <https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/Collection.html#toArray(T%5B%5D)>
>>
>>         But I can understand that this argument might not be
>>         persuading you.
>>
>>         For endianness, I think that there is a difference between:
>>
>>         * I have a segment and I want to read an int
>>         * I have a segment and I want to copy parts of it in an
>>         _existing_ array
>>
>>         The former is super common, the latter feels like something
>>         you reach for in more advanced use cases, honestly. My
>>         feeling is that having a "more friendly" overload for what is
>>         a pretty advanced method anyway feels some sort of middle
>>         ground that would double the size of the API w/o really
>>         making it more usable (I don't thing users like you would be
>>         bothered by an extra `nativeOrder()` suffix). At least that's
>>         what I was assuming.
>>
>>         As for (1) - some comments:
>>
>>         * you seem to assume that "native implementation" == better
>>         performance. I don't think that's the case. And I don't see
>>         why you think it'd be more elegant that way. The main thing
>>         that needs to be done, if we go down this path, is to keep
>>         profiling information sharp - that is, if you call
>>         copyToArray with a byte[] and then with a short[] you want C2
>>         to work with different type profile information. For this
>>         there is a technique in hotspot, called argument type
>>         profiling, which is used for pseudo static methods in Unsafe
>>         (and in MemoryAccess) which I think we can use here.
>>
>>         * I don't see why it has to be a "static" method, or why a
>>         static method is perceived to be better. The issue with
>>         static methods is that they don't mix with slicing:
>>
>>         MemorySegment.copyToArray(sourceSegment.asSlice(), ....)
>>
>>         this is worse than:
>>
>>         sourceSegment.asSlice().copyToArray(...)
>>
>>         (btw, this avoids, at least in part, some of the type
>>         profiling pot-holes highlighted above, as receiver type
>>         information is always speculated upon, w/o doing anything
>>         special). So, two instances methods copyToArray and
>>         copyFromArray don't sound less elegant to me (actually more so)
>>
>>         * Accepting an Object parameter is inconsistent with e.g. the
>>         MemorySegment.of(...) factories, which take sharp array
>>         types. In principle those factories could be compressed down
>>         into one (in fact the API started off that way, but we went
>>         back for performance issues).
>>
>>
>>         in both approaches, I think if we use an instance method we
>>         should just drop the length parameter from the method and
>>         infer the length from the receiver segment. Here's a table I
>>         prepared yesterday night thinking about all possible
>>         combinations (this assumed approach (1)):
>>
>>         * from array to segment
>>             - segment does not exist: MemorySegment.of(array)
>>         [.asSlice(...)]
>>             - segment does exist: segment [.asSlice(...)]
>>         .copyFrom(array, index, order)
>>         * from segment to array
>>             - array does not exist: segment [.asSlice(...)] .toXYZArray()
>>             - arrays does exist: segment [.asSlice(...)]
>>         .toXYZArray(array, index, order)
>>
>>         (this still doesn't seem to be that bad - if toXYZArray name
>>         bothers you, just replace that with copyToArray - the rest
>>         stays the same).
>>
>>         Maurizio
>>
>>         On 28/04/2021 01:43, leerho wrote:
>>>         Maurizio,
>>>
>>>         This is great.  You have opened up several new possibilities
>>>         and I'd like to explore some of them.
>>>
>>>         1. The use of /Object/, a la /System.arrayCopy(Object src,
>>>         int srcPos, Object dest, int destPos, int length)./
>>>         There are several interesting ideas (and concerns) that spin
>>>         off of this idea.
>>>
>>>          1. First, is that System.arrayCopy() is a native method so
>>>             all the work is under the covers, which I cannot see (I
>>>             don't know how to find the C++ code that implements this
>>>             method.)
>>>               * Advantage: This would greatly reduce the number of
>>>                 methods needed (as you pointed out).
>>>               * Concern: Performance.  I would think we would be
>>>                 moving work that could be done at compile time
>>>                 (e.g., what type of array is it?) to run time.   Or
>>>                 would Hotspot figure this out ?
>>>               * Advantage: Potentially this could allow boxed array
>>>                 types as the native code could just figure this out
>>>                 and do the right thing.
>>>               * Concern: because of its similarity to
>>>                 /System.arrayCopy()/, the user might be tempted to
>>>                 use it for array types that are completely
>>>                 inappropriate.  (Just throw an exception.)
>>>          2. This /Object/ idea also led me to: What if we extended
>>>             the C++ code underneath arrayCopy() to allow one or both
>>>             of the Objects to be a MemorySegment?
>>>               * Make it clear that both Pos arguments and the length
>>>                 argument would have to be in bytes and these
>>>                 arguments would have to become longs.  This fact
>>>                 alone will kill this idea as it raises all kinds of
>>>                 issues. Terrible idea, forget this one!
>>>          3. What about this idea. Two /static void arrayCopy-type/
>>>             methods in MemorySegment: (or 4 if we overload with
>>>             ByteOrder):
>>>               * /copyToArray(MemorySegment srcSegment, long
>>>                 srcOffsetBytes, Object dstArray, int dstIndex, int
>>>                 dstElements)/;
>>>               * /copyFromArray(Object srcArray, int srcIndex, int
>>>                 srcElements, MemorySegment dstSegment, long
>>>                 dstOffsetBytes)/;
>>>               * If these are implemented as native methods AND if
>>>                 there is not a runtime performance burden, THIS
>>>                 would be elegant!
>>>
>>>         2. "split" the overloads between copyFrom and toXYZ arrays.
>>>
>>>           * Is this what you had in mind? (7 or 14 of these pairs)
>>>               o /int[] dstArray = srcSegment.asSlice(long
>>>                 offsetBytes, long lengthBytes).toIntArray(int[]
>>>                 dstArray, int dstIndex, int dstElements)/;
>>>                 Because toIntArray returns an array, this changes
>>>                 the behavior whereby the array is provided and
>>>                 returned. Clunky.
>>>                 This requires specifying the length twice, once in
>>>                 bytes, the other in elements. :-(
>>>               o /dstSegment.asSlice(long offsetBytes, long
>>>                 lengthBytes).copyFrom(int[] srcArray, int srcIndex,
>>>                 int srcElements)/;
>>>                 This also requires specifying the length twice, once
>>>                 in bytes, the other in elements.  :-(
>>>           * Perhaps you meant this? (7 or 14 of these pairs)
>>>               o /int[] dstArray = srcSegment.toIntArray(long
>>>                 srcOffsetBytes, int[] dstArray, int dstIndex, int
>>>                 dstElements)/;
>>>                 This still has a clunky return of the dstArray that
>>>                 isn't really needed. Perhaps this variant would be a
>>>                 void return?
>>>               o /dstSegment.copyFrom(int[] srcArray, int srcIndex,
>>>                 int srcElements, long dstOffsetBytes)/;
>>>
>>>         I think I like 1.3 the best.
>>>
>>>           * Only 2 or 4 new methods (let's assume 4)
>>>           * But it may require the most under-the-covers work, which
>>>             would be in C++.  I'm hoping some of the code from
>>>             System.arrayCopy() could be leveraged.
>>>
>>>         *Note*: I think there is a good reason why MemoryAccess
>>>         overloads all the methods with ByteOffset set to the
>>>         default: and that is because 90% of the use-cases and users
>>>         want the default and don't really want to always be
>>>         specifying ByteOrder, even if it is only a static method and
>>>         quite fast; it would still be viewed as unnecessary!  If we
>>>         are choosing between only 2 or 4 added methods, overloading
>>>         with the default ByteOrder should not be a big deal.
>>>
>>>         Cheers,
>>>
>>>         Lee.
>>>
>>>         /
>>>         /
>>>
>>>
>>>         On Tue, Apr 27, 2021 at 2:13 PM Maurizio Cimadamore
>>>         <maurizio.cimadamore at oracle.com
>>>         <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>
>>>
>>>             On 27/04/2021 21:50, Maurizio Cimadamore wrote:
>>>             > is just one native call away, after all.
>>>
>>>             Whoops - I meant "static" not, "native", of course :-)
>>>
>>>             Maurizio
>>>