[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays
Maurizio Cimadamore
mcimadamore at openjdk.java.net
Wed Jun 16 09:44:59 UTC 2021
On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:
> After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.
>
> This is caused by the fact that, on segment creation, we perform this test:
>
>
> static int defaultAccessModes(long size) {
> return (enableSmallSegments && size < Integer.MAX_VALUE) ?
> SMALL : 0;
> }
>
>
> To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.
>
> Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.
>
> For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.
>
> While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:
>
> https://github.com/openjdk/jdk/pull/2045
Minor breakthrough. I explored more the assumption that the copy code in Lucene is not hot enough for C2 to kick in - and started to play with JMH's `@CompilerControl(EXCLUDE)` directive. Performance dips significantly (~10x), but what's striking is that I finally see lots of allocations when using slicing and no allocations when using the static copy method:
Benchmark Mode Cnt Score Error Units
TestSmallCopy.segment_small_copy_slice avgt 30 157.831 ? 4.524 ns/op
TestSmallCopy.segment_small_copy_slice:?gc.alloc.rate avgt 30 241.978 ? 7.438 MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.alloc.rate.norm avgt 30 80.007 ? 0.002 B/op
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Eden_Space avgt 30 246.372 ? 74.872 MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Eden_Space.norm avgt 30 81.763 ? 24.998 B/op
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Survivor_Space avgt 30 0.001 ? 0.002 MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Survivor_Space.norm avgt 30 ? 10?? B/op
TestSmallCopy.segment_small_copy_slice:?gc.count avgt 30 25.000 counts
TestSmallCopy.segment_small_copy_slice:?gc.time avgt 30 23.000 ms
vs:
Benchmark Mode Cnt Score Error Units
TestSmallCopy.segment_small_copy_static avgt 30 100.394 ? 7.448 ns/op
TestSmallCopy.segment_small_copy_static:?gc.alloc.rate avgt 30 ? 10?? MB/sec
TestSmallCopy.segment_small_copy_static:?gc.alloc.rate.norm avgt 30 ? 10?? B/op
TestSmallCopy.segment_small_copy_static:?gc.count avgt 30 ? 0 counts
This might well explain what @uschindler is seeing. If the code is not hot, EA won't even run, which would explain the allocation. So, if this is the case, we're far less concerned with the cost of single liveness checks, but we'd be more worried about the cost of introducing that many allocations.
-------------
PR: https://git.openjdk.java.net/panama-foreign/pull/560
More information about the panama-dev
mailing list