[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays

Wed Jun 16 09:44:59 UTC 2021

On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.
> 
> This is caused by the fact that, on segment creation, we perform this test:
> 
> 
> static int defaultAccessModes(long size) {
>         return (enableSmallSegments && size < Integer.MAX_VALUE) ?
>                 SMALL : 0;
>     }
> 
> 
> To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.
> 
> Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.
> 
> For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.
> 
> While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:
> 
> https://github.com/openjdk/jdk/pull/2045

Minor breakthrough. I explored more the assumption that the copy code in Lucene is not hot enough for C2 to kick in - and started to play with JMH's `@CompilerControl(EXCLUDE)` directive. Performance dips significantly (~10x), but what's striking is that I finally see lots of allocations when using slicing and no allocations when using the static copy method:

Benchmark                                                                Mode  Cnt    Score    Error   Units
TestSmallCopy.segment_small_copy_slice                                   avgt   30  157.831 ?  4.524   ns/op
TestSmallCopy.segment_small_copy_slice:?gc.alloc.rate                    avgt   30  241.978 ?  7.438  MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.alloc.rate.norm               avgt   30   80.007 ?  0.002    B/op
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Eden_Space           avgt   30  246.372 ? 74.872  MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Eden_Space.norm      avgt   30   81.763 ? 24.998    B/op
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Survivor_Space       avgt   30    0.001 ?  0.002  MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Survivor_Space.norm  avgt   30   ? 10??             B/op
TestSmallCopy.segment_small_copy_slice:?gc.count                         avgt   30   25.000           counts
TestSmallCopy.segment_small_copy_slice:?gc.time                          avgt   30   23.000               ms

vs:

Benchmark                                                    Mode  Cnt    Score    Error   Units
TestSmallCopy.segment_small_copy_static                      avgt   30  100.394 ?  7.448   ns/op
TestSmallCopy.segment_small_copy_static:?gc.alloc.rate       avgt   30   ? 10??           MB/sec
TestSmallCopy.segment_small_copy_static:?gc.alloc.rate.norm  avgt   30   ? 10??             B/op
TestSmallCopy.segment_small_copy_static:?gc.count            avgt   30      ? 0           counts

This might well explain what @uschindler  is seeing. If the code is not hot, EA won't even run, which would explain the allocation. So, if this is the case, we're far less concerned with the cost of single liveness checks, but we'd be more worried about the cost of introducing that many allocations.

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/560