Part 1 Proposal for JDK-8264594

Thu Apr 29 10:23:14 UTC 2021

On 29/04/2021 05:49, leerho wrote:
> Maurizio,
>
> First, I would like to clear up some misunderstandings.
>
> 1. Re overloading with/without ByteOrder:
>
>     My feeling is that having a "more friendly" overload for what is a
>     pretty advanced method anyway feels some sort of middle ground
>     that would double the size of the API w/o really making it more
>     usable (I don't thing users like you would be bothered by an extra
>     `nativeOrder()` suffix). At least that's what I was assuming.
>
>
> I was trying to make the distinction between 2 vs 4 methods, assuming 
> Object array), and 14 vs 28 methods assuming individual XYZArrays.  
> The second case is clearly a more egregious impact on the API so I 
> would definitely agree that not overloading the ByteOrder makes sense. 
> *And yes, I can live with that :)*
>
> The 2 vs 4 methods refers to the next item.
Yeah - I realized later that you were considering different cost vs. 
benefit ratios in the two proposals - in one case it seems to cost more 
than the other, I agree. (I also agree, in general, that if we could 
provide a ByteOrder-less version, it would be nice to have).
>
> 2. Use of Object array:
>
>     Accepting an Object parameter is inconsistent with e.g. the
>     MemorySegment.of(...) factories, which take sharp array types. In
>     principle those factories could be compressed down into one (in
>     fact the API started off that way, but we went back for
>     performance issues).
>
> That is why I posed the question wrt use of Object array:
>
>     Concern: Performance.  I would think we would be moving work that
>     could be done at compile time (e.g., what type of array is it? )
>     to run time.   Or would Hotspot figure this out ?
>
>
> I wasn't sure, which is why I asked. And you indicated that 
> performance with an Object array is an issue *and I believe you* :)

There is a difference between MemorySegment.of(Object) and 
MemorySegment.copyInto(Object). In the first operation we are "creating" 
a new segment; I have observed that, in general, it is better to have 
creation of new instances as a transparent process as possible: if 
different methods create different concrete Java instances (for instance 
MemorySegment::ofArray(int[]) creates a different object than 
MemorySegment::ofArray(double[])) the code tends to be more 
"transparent" to the optimizing compiler, and typically that works 
better than having a single factory which does a bunch of stuff/control 
flow and might return different objects (which will necessarily create 
some pollution). I believe the case of memory copy is not dissimilar 
from what MemoryAccess.getInt does: note that you just pass a `memory 
segment` - we don't have overloads for "heap segment", "mapped segment" 
and "native segment" and yet C2 is able to figure that out (because we 
turned enhanced profiling on for these methods). A similar trick, I'm 
sure could be done here, if we were willing to. Of course we need 
benchmarks to test and validate the approach.

>
> 3. Native implementations:
>
>     You seem to assume that "native implementation" == better performance.
>
>
> No, I don't really.  My only comment about native methods is that I 
> don't know where to find the C++ code so I can't figure out what is 
> going on under the covers.  Additionally, as you pointed out, the C2 
> hotspot compiler has additional profiling it can take advantage of ... 
> and I even know less about that! *So I'll leave it to you what methods 
> should be made native*.

So, there are two routines for copying memory: System::arrayCopy and 
Unsafe::copyMemory. Performance-wise they behave the same - I believe 
one minor difference is that System.arrayCopy inserts some GC safe point 
so that, in the middle of a big copy, it can still pause for GC - 
Unsafe::copyMemory (the one we use) doesn't do that. They are both 
native routines, and are both "intrinsified" by the VM - which means 
that when they are called enough time, the VM replaces them with bits of 
machine code generated by the optimizing compiler. This is where 
information about types becomes crucial: if, when the intrinsification 
occurs, C2 can prove that e.g. you are copying from off-heap into a 
int[], then a sharper copy can be performed. If the types are opaque (so 
C2 doesn't know whether they are on/off heap), then performance suffers. 
So, as long as the types are kept "sharp", this is typically not an issue.

There's also another important consideration: copyMemory (or 
System::arrayCopy) is, relatively speaking, a "slow" operation (since it 
can touch many addresses). In my experience it is way less likely to 
observe performance issues with this routines than it is with a 
single-shot memory access (which boils down to a single MOV, so every 
extra overhead counts big time!).

>
> *Comment*: In our Apache DataSketches library we have implemented all 
> of our sketching algorithms in both C++ and in Java.  And we have 
> experienced a number of cases where our highly-tuned java 
> implementation handily beats a casually constructed C++ 
> implementation!   We have found that there are similar issues in both 
> languages.  If the developer tends to over-rely on the general-purpose 
> STL librarys in C++ it will impact performance in the same way that 
> over-reliance on the Java Collections library will.  But a well-tuned 
> C++ implementation will generally beat a well-tuned Java 
> implementation, but not by very much!  And with this Panama module, I 
> am hoping we can make the difference even less.
>
> I think we can agree that the above 3 issues can be put aside and 
> don't need further discussion..
ok
>
> ****
> Back to the method design:
>
>     I think if we use an instance method we should just drop the
>     length parameter from the method and infer the length from the
>     receiver segment.
>
>
> This does present some usability concerns as I will highlight next.
>
> Thoughts on your 2 X 2 table:
> I think we don't need to focus on the two cases where either the 
> segment or the array does not exist.

Sure - I was showing them to demonstrate the emerging symmetry/pattern

> These can already be handled with the current API. Now for the other 
> two cases:
>
> 1. From array to existing segment, you wrote:
>
>     segment [.asSlice(...)] .copyFrom(array, index, order)
>
>
> My translation:
>
>     dstSegment.asSlice(long dstOffsetBytes, long
>     dstLengthBytes).copyFromArray(XYZ[] srcArray, int srcIndex,
>     ByteOrder order);
>
>   * The user generally has better knowledge of the array coordinates
>     (start, length). Here the user would have to translate from array
>     srcElements to dstLengthBytes.  I don't consider this friendly.
>   * I don't think you intended to overload the existing
>     copyFrom(segment), thus, copyFromArray(...).
>   * The new methods /copyFromArray(XYZ[] srcArray, srcIndex, order)
>     are /not as general purpose as they would it be if it included the
>     array length argument, but leads to the problem of dual
>     specification of the length both in bytes and array elements.
>
I think of these three I can see the point of the first bullet - but it 
feels like having offset/length on the copyFrom feels slightly redundant 
(since asSlice does kind of the same thing, and that's what we tell 
people to do for segment->segment copyFrom). Anyway, point taken.
> 2. From segment to existing array, you wrote:
>
>     segment [.asSlice(...)] .toXYZArray(array, index, order)
>
>
> My translation:
>
>     srcSegment.asSlice(long dstOffsetBytes, long
>     dstLengthBytes).copyToArray(XYZ[] dstArray, int dstIndex,
>     ByteOrder order)
>
>   * Again, the user would have to translate from his dstElements to
>     dstLengthBytes.
>   * Again, the new methods /copyToArray(XYZ[] dstArray, dstIndex,
>     order)/ are not as general purpose as mentioned above. So the
>     above comment about the copyFromArray(...) apply here as well.
>
> My conclusion, so far, is that splitting our task into 7 new 
> copyFromArray(...) methods and 7 new copyToArray(...) methods that 
> must operate in a chain, provides a clumsy solution.

I think "clumsy" here might be a bit of a glass-half-empty comment. And 
some points as to why the glass is perceived to be half empty are 
probably well taken (e.g. translation of coordinates) but I think there 
is a general "I don't want to slice" feel to all this, which sounds like 
jumping into premature optimization. I'd like to make sure we resist 
that temptation, no matter what we end up doing in the end.

>
> *So what is our task?*
>
> What we want our method(s) to accomplish, *at the lowest level,* is 
> actually very simple.  We want to copy /lengthBytes/ from one blob of 
> bytes to another blob of bytes, each with a specific offset.  As a 
> diagram, it looks like this:
> ArrayCopyForMemorySegment.png
> Let's forget about arrays of XYZ for a moment.  We could accomplish 
> this with a simple method like :
>
>     void copy(source, srcOffsetBytes, destination, dstOffsetBytes,
>     lengthBytes);
>
>
> The beauty of the unsafe methods is that under the covers, everything 
> is bytes.  And if you /unsafe.putInt//(offset, int)/ into memory, 
> unsafe views it as just 4 bytes and it forgets that it ever was an 
> integer.
>
> A MemorySegment is just a blob of bytes somewhere in memory.  So if we 
> just had
>
>     void copy(srcSegment, srcOffsetBytes, dstSegment, dstOffsetBytes,
>     lengthBytes);
>
>
> *This would already be a powerful method on its own!  (And we don't 
> have this!)*
**

Ok, now the thing I was afraid of (premature optimization) is sneaking 
in full - it's not true we don't have support for the above - we do, but 
not in the way you described:

sourceSegment.slice(....).copyFrom(destSegment.slice(...))

Combining copyFrom and asSlice you can obtain whatever combination of 
copying you can obtain with low level Unsafe::copyMemory.

We're not talking about missing primitives here (it's all in there). 
And, simplicity is often in the eye of the beholder: you seem to imply 
that a method taking 5 (!!) arguments is better than having a 1-arg 
method and moving the slicing closer to where it belongs. I think I 
object to that. In the slicing version there's no ambiguity as to what 
parameters refer to.

The main reason as to why I would see someone prefer the 5 argument 
version is, again, some notion that slicing costs too much.

Anyway - copying segment to segment is a completely different problems 
from copying segments into an existing array. I believe the former is 
addressed by the API (and I don't see a significant margin of 
improvements there). In the latter case, I agree there is too much 
convoluted-ness going on, with creating a temp segment for holding heap 
memory and what's not.

> In addition, we would like to enhance this method so that either the 
> source or the destination could be one of the 7 primitive Java arrays.
>
> The task for the method that needs to do this is trivial: To convert 
> from array units to bytes is a simple left shift. To convert from 
> bytes to array units is a simple right-shift instruction.  Both of 
> these are single-cycle CPU instructions. The /copy/ should be a 
> low-level bulk copy operation. This method should be as fast as a 
> comparable operation in C/C++.  I don't see any reason why this 
> couldn't be implemented as a native method.  In fact, the lowest level 
> of this operation could probably be accomplished with a few machine 
> instructions.  (However, a really powerful bulk array copy that takes 
> advantage of cache-line sizes and non-aligned offsets is more complex; 
> but this kind of copy method should already exist in the JVM.)
>
> For arrays, the simplest solution, IMHO, would be 7 copyToArray() 
> methods and 7 copyFromArray() methods structured as follows:
>
>     void copyFromArray(XYZ[] srcArray, int srcStartIndex, int srcElements,
>         MemorySegment dstSegment, long dstOffsetBytes, ByteOrder order);
>
> and
>
>     void copyToArray(MemorySegment srcSegment, long srcOffsetBytes,
>         XYZ[] dstArray, int dstStartIndex, int dstElements, ByteOrder
>     order) 
>
>   * The simple translation of array units to bytes and the reverse is
>     handled by the method.
>     Thus, the user does not need to bother doing this, which helps
>     eliminate human error.
>   * There is no duplication of the length specification, it is only
>     specified once.
>   * The method names are self explanatory including the relevant
>     units, where relevant
>   * As these methods are self contained and with no escapes. It could
>     easily be a static method.
>   * These methods solve the complete problem.  No need to chain
>     multiple methods together.
>   * These methods could be part of MemorySegment, MemoryAccess, or its
>     own class.
>
If this path is preferred, and we also join in with Radoslaw comment, I 
think I'd like to add all the required overloads (including 
ByteOrder-less ones) into a new class: MemoryCopy.

E.g. let's keep the MemorySegment API thin, by adding only one method:

copyFrom(MemorySegment, MemoryLayout elemLayout, ByteOrder order)

Which generalizes the existing copyFrom, allowing for swap copy.

And then let's build on this, and provide several static methods on the 
side, like we do with MemoryAccess.

Maurizio

> Cheers, and thank you for your patience!
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Apr 28, 2021 at 2:49 AM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com 
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
>     Hi Lee
>     IMHO some of the disadvantages you mention about 2 are a bit
>     accidental - and not dissimilar to what happens here:
>
>     https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/Collection.html#toArray(T%5B%5D)
>     <https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/Collection.html#toArray(T%5B%5D)>
>
>     But I can understand that this argument might not be persuading you.
>
>     For endianness, I think that there is a difference between:
>
>     * I have a segment and I want to read an int
>     * I have a segment and I want to copy parts of it in an _existing_
>     array
>
>     The former is super common, the latter feels like something you
>     reach for in more advanced use cases, honestly. My feeling is that
>     having a "more friendly" overload for what is a pretty advanced
>     method anyway feels some sort of middle ground that would double
>     the size of the API w/o really making it more usable (I don't
>     thing users like you would be bothered by an extra `nativeOrder()`
>     suffix). At least that's what I was assuming.
>
>     As for (1) - some comments:
>
>     * you seem to assume that "native implementation" == better
>     performance. I don't think that's the case. And I don't see why
>     you think it'd be more elegant that way. The main thing that needs
>     to be done, if we go down this path, is to keep profiling
>     information sharp - that is, if you call copyToArray with a byte[]
>     and then with a short[] you want C2 to work with different type
>     profile information. For this there is a technique in hotspot,
>     called argument type profiling, which is used for pseudo static
>     methods in Unsafe (and in MemoryAccess) which I think we can use here.
>
>     * I don't see why it has to be a "static" method, or why a static
>     method is perceived to be better. The issue with static methods is
>     that they don't mix with slicing:
>
>     MemorySegment.copyToArray(sourceSegment.asSlice(), ....)
>
>     this is worse than:
>
>     sourceSegment.asSlice().copyToArray(...)
>
>     (btw, this avoids, at least in part, some of the type profiling
>     pot-holes highlighted above, as receiver type information is
>     always speculated upon, w/o doing anything special). So, two
>     instances methods copyToArray and copyFromArray don't sound less
>     elegant to me (actually more so)
>
>     * Accepting an Object parameter is inconsistent with e.g. the
>     MemorySegment.of(...) factories, which take sharp array types. In
>     principle those factories could be compressed down into one (in
>     fact the API started off that way, but we went back for
>     performance issues).
>
>
>     in both approaches, I think if we use an instance method we should
>     just drop the length parameter from the method and infer the
>     length from the receiver segment. Here's a table I prepared
>     yesterday night thinking about all possible combinations (this
>     assumed approach (1)):
>
>     * from array to segment
>         - segment does not exist: MemorySegment.of(array) [.asSlice(...)]
>         - segment does exist: segment [.asSlice(...)] .copyFrom(array,
>     index, order)
>     * from segment to array
>         - array does not exist: segment [.asSlice(...)] .toXYZArray()
>         - arrays does exist: segment [.asSlice(...)]
>     .toXYZArray(array, index, order)
>
>     (this still doesn't seem to be that bad - if toXYZArray name
>     bothers you, just replace that with copyToArray - the rest stays
>     the same).
>
>     Maurizio
>
>     On 28/04/2021 01:43, leerho wrote:
>>     Maurizio,
>>
>>     This is great.  You have opened up several new possibilities and
>>     I'd like to explore some of them.
>>
>>     1. The use of /Object/, a la /System.arrayCopy(Object src, int
>>     srcPos, Object dest, int destPos, int length)./
>>     There are several interesting ideas (and concerns) that spin off
>>     of this idea.
>>
>>      1. First, is that System.arrayCopy() is a native method so all
>>         the work is under the covers, which I cannot see (I don't
>>         know how to find the C++ code that implements this method.)
>>           * Advantage: This would greatly reduce the number of
>>             methods needed (as you pointed out).
>>           * Concern: Performance.  I would think we would be moving
>>             work that could be done at compile time (e.g., what type
>>             of array is it?) to run time.  Or would Hotspot figure
>>             this out ?
>>           * Advantage: Potentially this could allow boxed array types
>>             as the native code could just figure this out and do the
>>             right thing.
>>           * Concern: because of its similarity to
>>             /System.arrayCopy()/, the user might be tempted to use it
>>             for array types that are completely inappropriate.  (Just
>>             throw an exception.)
>>      2. This /Object/ idea also led me to: What if we extended the
>>         C++ code underneath arrayCopy() to allow one or both of the
>>         Objects to be a MemorySegment?
>>           * Make it clear that both Pos arguments and the length
>>             argument would have to be in bytes and these arguments
>>             would have to become longs.  This fact alone will kill
>>             this idea as it raises all kinds of issues. Terrible
>>             idea, forget this one!
>>      3. What about this idea. Two /static void arrayCopy-type/
>>         methods in MemorySegment: (or 4 if we overload with ByteOrder):
>>           * /copyToArray(MemorySegment srcSegment, long
>>             srcOffsetBytes, Object dstArray, int dstIndex, int
>>             dstElements)/;
>>           * /copyFromArray(Object srcArray, int srcIndex, int
>>             srcElements, MemorySegment dstSegment, long dstOffsetBytes)/;
>>           * If these are implemented as native methods AND if there
>>             is not a runtime performance burden, THIS would be elegant!
>>
>>     2. "split" the overloads between copyFrom and toXYZ arrays.
>>
>>       * Is this what you had in mind? (7 or 14 of these pairs)
>>           o /int[] dstArray = srcSegment.asSlice(long offsetBytes,
>>             long lengthBytes).toIntArray(int[] dstArray, int
>>             dstIndex, int dstElements)/;
>>             Because toIntArray returns an array, this changes the
>>             behavior whereby the array is provided and returned. Clunky.
>>             This requires specifying the length twice, once in bytes,
>>             the other in elements. :-(
>>           o /dstSegment.asSlice(long offsetBytes, long
>>             lengthBytes).copyFrom(int[] srcArray, int srcIndex, int
>>             srcElements)/;
>>             This also requires specifying the length twice, once in
>>             bytes, the other in elements.  :-(
>>       * Perhaps you meant this? (7 or 14 of these pairs)
>>           o /int[] dstArray = srcSegment.toIntArray(long
>>             srcOffsetBytes, int[] dstArray, int dstIndex, int
>>             dstElements)/;
>>             This still has a clunky return of the dstArray that isn't
>>             really needed. Perhaps this variant would be a void return?
>>           o /dstSegment.copyFrom(int[] srcArray, int srcIndex, int
>>             srcElements, long dstOffsetBytes)/;
>>
>>     I think I like 1.3 the best.
>>
>>       * Only 2 or 4 new methods (let's assume 4)
>>       * But it may require the most under-the-covers work, which
>>         would be in C++.  I'm hoping some of the code from
>>         System.arrayCopy() could be leveraged.
>>
>>     *Note*: I think there is a good reason why MemoryAccess overloads
>>     all the methods with ByteOffset set to the default: and that is
>>     because 90% of the use-cases and users want the default and don't
>>     really want to always be specifying ByteOrder, even if it is only
>>     a static method and quite fast; it would still be viewed as
>>     unnecessary!  If we are choosing between only 2 or 4 added
>>     methods, overloading with the default ByteOrder should not be a
>>     big deal.
>>
>>     Cheers,
>>
>>     Lee.
>>
>>     /
>>     /
>>
>>
>>     On Tue, Apr 27, 2021 at 2:13 PM Maurizio Cimadamore
>>     <maurizio.cimadamore at oracle.com
>>     <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>>
>>         On 27/04/2021 21:50, Maurizio Cimadamore wrote:
>>         > is just one native call away, after all.
>>
>>         Whoops - I meant "static" not, "native", of course :-)
>>
>>         Maurizio
>>