Memory Segment efficient array handling

Tue Jun 15 12:06:40 UTC 2021

Hi Maurizio,

> Hi Uwe,
> I'm doing other benchmark and, while I see that for _small_ segments
> (~50 elements), the bulk copy is slower than unsafe, I don't see the
> allocation, even if I disable inlining in JMH.
> 
> The reason why copying is slower for small segments is that, I think, my
> benchmark iterations run in 4-5ns and, at these levels, you are
> sensitive to the cost of the liveness checks. There's not a lot we can
> do to eliminate that - it's the price of safety.

OK that's true. But in my own Unsafe.copyMemory code wasn't doing MemorySegment.address().toRawLongValue() not also doing the liveness check?

> Now, while benchmarking, I realized that all the benchmark I have tried
> so far were of the kind:
> 
> ```
> segment.asSlice(srcOffset,
> nbytes).copyFrom(bytesSegment.asSlice(targetOffset, nbytes));
> ```
> 
> In your case though, you create the heap segment on the fly, something
> like this:
> 
> ```
> segment.asSlice(srcOffset,
> nbytes).copyFrom(MemorySegment.ofArray(bytes).asSlice(targetOffset,
> nbytes));
> ```
> 
> And I can indeed see that the allocation profile for the two cases is
> different - in the former (segment_small_copy) there's no allocation,
> while in the latter (segment_small_copy_fresh) I can see some.
> 
> ```
> TestSmallCopy.segment_small_copy avgt   30   7.039 ?  0.268   ns/op
> TestSmallCopy.segment_small_copy:?gc.alloc.rate avgt   30  ?
> 10??           MB/sec
> TestSmallCopy.segment_small_copy:?gc.alloc.rate.norm avgt   30  ?
> 10??             B/op
> TestSmallCopy.segment_small_copy:?gc.count avgt   30     ? 0
> counts
> TestSmallCopy.segment_small_copy_fresh avgt   30   6.870 ?  0.469   ns/op
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate avgt   30   0.366
> ?  1.337  MB/sec
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate.norm avgt   30
> 0.003 ?  0.010    B/op
> TestSmallCopy.segment_small_copy_fresh:?gc.churn.G1_Eden_Space avgt
> 30   0.799 ?  2.925  MB/sec
> TestSmallCopy.segment_small_copy_fresh:?gc.churn.G1_Eden_Space.norm
> avgt   30   0.006 ?  0.022    B/op
> TestSmallCopy.segment_small_copy_fresh:?gc.count avgt   30
> 1.000           counts
> TestSmallCopy.segment_small_copy_fresh:?gc.time avgt   30
> 1.000               ms
> ```
> 
> After looking at this I immediately realized what the issue could be -
> and it is, albeit indirectly, related to long loop optimizations
> (again). To speed up "small" segments (whose size fits in an integer) we
> check the segment size at creation and set a flag, using this method:
> 
> ```
> static int defaultAccessModes(long size) {
>          return (enableSmallSegments && size < Integer.MAX_VALUE) ?
>                  SMALL : 0;
>      }
> ```
> 
> Now, doing this in the middle of code where you expect escape analysis
> to kick in is plain evil: escape analysis is super sensitive to control
> flow. And, sadly, the above code uses control flow. This means that
> escape analysis would be disabled in this particular case.

Thanks for this. I was stumbling on the SMALL segment stuff already (see my comment below).
This explains why I only see memory allocations of those array wrappers and not many slice dupes!

> The good news is that in the case of heap segment wrappers, we know
> exactly that the size is gonna fit inside an int (it's an array after
> all!), so we can remove the check. If we remove the flag check, and just
> set the segment to always be treated as "SMALL", we get the following
> results:

Is this really true? What about a huge long[] array, that could be (2^31-8 maximum index) * 8 bytesPerLong = 16 gigabytes? (or are those limited by the Java standard in maximum array size?)

> 
> ```
> Benchmark                                                   Mode Cnt
> Score    Error   Units
> TestSmallCopy.segment_small_copy_fresh                      avgt 30
> 7.039 ?  0.264   ns/op
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate       avgt 30  ?
> 10??           MB/sec
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate.norm  avgt 30  ?
> 10??             B/op
> TestSmallCopy.segment_small_copy_fresh:?gc.count            avgt 30
> ? 0           counts
> ```
> 
> Where allocation profile is back to normal. Note that all the above
> benchmarks use `@CompilerControl(DONT_INLINE)`, so the results should be
> pretty independent on whether the method doing the copy is inlined or
> not (e.g. with @ForceInline).
> 
> I will file a PR for Panama soon - if you could test that directly it
> would be great and we could even integrate into 17.

Thanks, this is great!

> Of course this doesn't remove the need for "easier to use" bulk copy
> methods, but I think we need to make sure that wrapping segments like
> your code does should not add any unwanted costs.
> 
> Also note: even with the allocation fix, performance of the bulk copy
> with segments with few elements is still somewhat inferior compared with
> unsafe (because of the extra bound and liveness checks). If you are only
> copying a dozen of elements you might be better off with a loop (the
> byte buffer API had such an internal threshold, the memory segment API
> doesn't). What I'm saying is: as horrible the allocation stats might
> look, we shouldn't jump to the conclusion that the slowdown is caused by
> additional allocation. At least in the benchmark I have, fixing
> allocation does NOT bring segment bulk copy performance 100% on par with
> unsafe:
> 
> ```
> Benchmark                               Mode  Cnt  Score   Error Units
> TestSmallCopy.segment_small_copy        avgt   30  5.464 ? 0.374 ns/op
> TestSmallCopy.segment_small_copy_fresh  avgt   30  5.600 ? 0.366 ns/op
> TestSmallCopy.unsafe_small_copy         avgt   30  4.468 ? 0.322 ns/op
> ```
> 
> So, while we will try to fix the code in a way that rectifies the escape
> analysis woes, I think there might still be homeworks left to do :-)

Yes thanks. But this would also be some task for the MemoryCopy class in your recent pull request!

> There is a new class called MemoryCopy, with several static methods
> whose signature is vaguely similar to that of the bulk methods in the
> ByteBuffer API. So, instead of calling MemorySegment::copyFrom, you
> should call MemoryCopy:copyXYZ, which is a static method annotated with
> @ForceInline (at least to test that it works as expected).

Sorry, I missed that class. I was opening the pull request and only looked at the red/green changes in MemorySegment. I missed the first class.

The methods provided there exactly replace my three shapes in the code. And on top they allow to swap. Side note: MemoryCopy in the current PR has the swapping code commented out.

Last question: why does https://bugs.openjdk.java.net/browse/JDK-8223051 not help for the integer vs. long loop variables? Does this not allow to remove the small vs. large segment distinction? I thought this change in hotspot was added to support long loops, if it is in fact "small".

Uwe