[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays

Tue Jun 15 14:47:57 UTC 2021

On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.
> 
> This is caused by the fact that, on segment creation, we perform this test:
> 
> 
> static int defaultAccessModes(long size) {
>         return (enableSmallSegments && size < Integer.MAX_VALUE) ?
>                 SMALL : 0;
>     }
> 
> 
> To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.
> 
> Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.
> 
> For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.
> 
> While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:
> 
> https://github.com/openjdk/jdk/pull/2045

I'm no longer too sure this will make an actual difference. While jmh shows _something_, digging deeper reveals that, in most iterations there's absolutely no allocation:

# Run progress: 0.00% complete, ETA 00:00:22
# Fork: 1 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration   1: 8.009 ns/op
# Warmup Iteration   2: 6.775 ns/op
# Warmup Iteration   3: 3.475 ns/op
# Warmup Iteration   4: 6.647 ns/op
# Warmup Iteration   5: 6.924 ns/op
Iteration   1: 7.389 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   2: 6.593 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   3: 6.472 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   4: 7.103 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   5: 6.287 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   6: 6.976 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   7: 6.503 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   8: 7.057 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   9: 6.554 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration  10: 6.654 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

# Run progress: 33.33% complete, ETA 00:00:30
# Fork: 2 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration   1: 7.538 ns/op
# Warmup Iteration   2: 7.192 ns/op
# Warmup Iteration   3: 7.222 ns/op
# Warmup Iteration   4: 7.712 ns/op
# Warmup Iteration   5: 6.882 ns/op
Iteration   1: 7.016 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   2: 7.009 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   3: 6.958 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   4: 7.210 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   5: 7.860 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   6: 7.700 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   7: 3.655 ns/op   // <------------------------------------------------------------------------------------
                 ?gc.alloc.rate:                   21.009 MB/sec
                 ?gc.alloc.rate.norm:              0.161 B/op
                 ?gc.churn.G1_Eden_Space:          23.975 MB/sec
                 ?gc.churn.G1_Eden_Space.norm:     0.184 B/op
                 ?gc.churn.G1_Survivor_Space:      2.673 MB/sec
                 ?gc.churn.G1_Survivor_Space.norm: 0.021 B/op
                 ?gc.count:                        1.000 counts
                 ?gc.time:                         2.000 ms

Iteration   8: 7.049 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   9: 6.910 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration  10: 7.325 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

# Run progress: 66.67% complete, ETA 00:00:15
# Fork: 3 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration   1: 7.798 ns/op
# Warmup Iteration   2: 7.576 ns/op
# Warmup Iteration   3: 7.088 ns/op
# Warmup Iteration   4: 7.099 ns/op
# Warmup Iteration   5: 7.232 ns/op
Iteration   1: 7.368 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   2: 7.193 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   3: 6.731 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   4: 7.656 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   5: 7.626 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   6: 7.232 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   7: 7.276 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   8: 7.521 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   9: 6.777 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration  10: 7.298 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

(this benchmark test the copy in the same order as Lucene is doing it). Only _one_ iteration has non-zero allocation - presumably some kind of race between C2 and GC. But for the most part there's already no allocation here... at least according to JMH. We need a benchmark which reproduces the issue more precisely before attempting any fix, I think.

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/560