[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays
Maurizio Cimadamore
mcimadamore at openjdk.java.net
Tue Jun 15 14:47:57 UTC 2021
On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:
> After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.
>
> This is caused by the fact that, on segment creation, we perform this test:
>
>
> static int defaultAccessModes(long size) {
> return (enableSmallSegments && size < Integer.MAX_VALUE) ?
> SMALL : 0;
> }
>
>
> To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.
>
> Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.
>
> For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.
>
> While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:
>
> https://github.com/openjdk/jdk/pull/2045
I'm no longer too sure this will make an actual difference. While jmh shows _something_, digging deeper reveals that, in most iterations there's absolutely no allocation:
# Run progress: 0.00% complete, ETA 00:00:22
# Fork: 1 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration 1: 8.009 ns/op
# Warmup Iteration 2: 6.775 ns/op
# Warmup Iteration 3: 3.475 ns/op
# Warmup Iteration 4: 6.647 ns/op
# Warmup Iteration 5: 6.924 ns/op
Iteration 1: 7.389 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 2: 6.593 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 3: 6.472 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 4: 7.103 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 5: 6.287 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 6: 6.976 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 7: 6.503 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 8: 7.057 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 9: 6.554 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 10: 6.654 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
# Run progress: 33.33% complete, ETA 00:00:30
# Fork: 2 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration 1: 7.538 ns/op
# Warmup Iteration 2: 7.192 ns/op
# Warmup Iteration 3: 7.222 ns/op
# Warmup Iteration 4: 7.712 ns/op
# Warmup Iteration 5: 6.882 ns/op
Iteration 1: 7.016 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 2: 7.009 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 3: 6.958 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 4: 7.210 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 5: 7.860 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 6: 7.700 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 7: 3.655 ns/op // <------------------------------------------------------------------------------------
?gc.alloc.rate: 21.009 MB/sec
?gc.alloc.rate.norm: 0.161 B/op
?gc.churn.G1_Eden_Space: 23.975 MB/sec
?gc.churn.G1_Eden_Space.norm: 0.184 B/op
?gc.churn.G1_Survivor_Space: 2.673 MB/sec
?gc.churn.G1_Survivor_Space.norm: 0.021 B/op
?gc.count: 1.000 counts
?gc.time: 2.000 ms
Iteration 8: 7.049 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 9: 6.910 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 10: 7.325 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
# Run progress: 66.67% complete, ETA 00:00:15
# Fork: 3 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration 1: 7.798 ns/op
# Warmup Iteration 2: 7.576 ns/op
# Warmup Iteration 3: 7.088 ns/op
# Warmup Iteration 4: 7.099 ns/op
# Warmup Iteration 5: 7.232 ns/op
Iteration 1: 7.368 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 2: 7.193 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 3: 6.731 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 4: 7.656 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 5: 7.626 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 6: 7.232 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 7: 7.276 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 8: 7.521 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 9: 6.777 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
Iteration 10: 7.298 ns/op
?gc.alloc.rate: ? 10?? MB/sec
?gc.alloc.rate.norm: ? 10?? B/op
?gc.count: ? 0 counts
(this benchmark test the copy in the same order as Lucene is doing it). Only _one_ iteration has non-zero allocation - presumably some kind of race between C2 and GC. But for the most part there's already no allocation here... at least according to JMH. We need a benchmark which reproduces the issue more precisely before attempting any fix, I think.
-------------
PR: https://git.openjdk.java.net/panama-foreign/pull/560
More information about the panama-dev
mailing list