[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays

Wed Jun 16 09:47:49 UTC 2021

On Wed, 16 Jun 2021 09:42:14 GMT, Uwe Schindler <uschindler at openjdk.org> wrote:

>> After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.
>> 
>> This is caused by the fact that, on segment creation, we perform this test:
>> 
>> 
>> static int defaultAccessModes(long size) {
>>         return (enableSmallSegments && size < Integer.MAX_VALUE) ?
>>                 SMALL : 0;
>>     }
>> 
>> 
>> To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.
>> 
>> Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.
>> 
>> For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.
>> 
>> While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:
>> 
>> https://github.com/openjdk/jdk/pull/2045
>
> Hi Maurizio,
> it's not possible today, because I have a talk at a conference later, but tomorrrow, I can for sure test the repo provided by you.
> 
> To call the new static MemorySegment.copy() method, I still need to wrap the byte[] array, so I should merge in THIS pull request? From looking at your repository it does not seem to have the HeapMemorySegment SMALL change.
> 
> Maybe you can also add this small change from this PR also to your benchmark repo.
> 
> About the liveness check: In my unsafe copy implementation, I called `segment.address().toRawLongAddress()`, so it was also doing the liveness checks. This had no effect on the performance. The only difference is the additional fencing after the whole copyMemory operation in ScopedMemoryAccess.
> 
>> The issue with memory copy is that memory copy _is_ typically used in a
>> standalone fashion, e.g. outside of a loop. I'm pretty sure that if I
>> repeat the benchmark by wrapping it in a loop and do like 100
>> iterations, the difference with unsafe will disappear. But that's _not_
>> how users will use `copyFrom`.
> 
> In most cases this is true. Actually we have sometimes a loop, but very seldom: We only use mmap when reading index files, but not for writing new index files, which are just streamed to a classical OutputStream. Sometimes we have to copy data to an OutputStream that was previously read from a MMapped InputStream (happens when merging Lucene index segments). In that case the code will call IndexInput.readBytes(...) and pass the byte array to OutputStream in the usual and well-known copy-loop. But that's of course not the usual case (especially not in our benchmark, because it does not build new indexes); it just executes queries.

> This might well explain what @uschindler is seeing. If the code is not hot, EA won't even run, which would explain the allocation. So, if this is the case, we're far less concerned with the cost of single liveness checks, but we'd be more worried about the cost of introducing that many allocations.

So would the new MemoryCopy class from the other pull request help, because it has `@ForceInline`?

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/560