compose MemorySegments

Fri Jun 11 10:47:11 UTC 2021

On 11/06/2021 00:24, Douglas Surber wrote:
>
> Maurizio,
>
> I can certainly respect a decision that composing multiple 
> MemorySegments might be out of scope. Without composition I would 
> write something like this.
>
>     MemorySegment.ofArray(dest)
>     .asSlice(destOffset, firstPartLength)
>     .copyFrom(MemorySegment.ofArray(src0).asSlice(srcOffset,
>     firstPartLength);
>     MemorySegment.ofArray(dest)
>     .asSlice(destOffset + firstPartLength, destOffset +
>     firstPartLength + secondPartLength)
>     .copyFrom(MemorySegment.ofArray(src1).asSlice(0L, secondPartLength));
>
>
> This would copy the bits from the end of src0 into the first part of 
> dest and the bits from the beginning of src1 into the second part of 
> dest. Would all this result in just two SIMD instructions modulo 
> bounds checking? And no allocations?

Short answer is, yes, possibly.

Slightly longer answer - as I mentioned we're working to offer different 
helpers to copy data from segments to arrays and viceversa, so that the 
code above can be made significantly more readable.

Looking a bit further down the road, memory segments are effectively 
immutable (the bit of mutable state which tells whether a resource is 
released or not is in the so called "resource scope", not in a 
MemorySegment). So, right now, we can sometimes take advantage of escape 
analysis to eliminate allocation of segment slices (e.g. if a segment 
doesn't "escape", it can often be scalarized into registers). Of course 
escape analysis doesn't always work, especially when a method contains 
control flow. But Valhalla will give us the ability to implement the 
MemorySegment interface with a true "primitive class" - for which 
allocation behavior could be much more predictable.

So, if the Java code surrounding the bulk copy is optimized enough, you 
do get a pretty optimized bulk copy. In your specific case, there's no 
control flow, and the temporary slices you create are only used inside 
the copy method - which makes me think that stuff like this should 
already perform decently, assuming the code above gets inlined by C2.

Here's a benchmark I've tweaked in the past to show the (non) cost of 
slicing prior to a bulk op:

https://mail.openjdk.java.net/pipermail/panama-dev/2021-April/012889.html

and

https://mail.openjdk.java.net/pipermail/panama-dev/2021-April/012897.html

The first shows throughput, the latter allocation rate, which is zero 
for both unsafe and the memory segment APIs.

This doesn't mean that in _all_ cases allocations will be eliminated 
(see above), but we're in relatively good shape, and we have plans to 
make allocations even less expensive as new VM features are rolled out.

Maurizio

>
> Douglas