Memory Segment efficient array handling
Uwe Schindler
uschindler at apache.org
Tue Jun 15 12:06:40 UTC 2021
Hi Maurizio,
> Hi Uwe,
> I'm doing other benchmark and, while I see that for _small_ segments
> (~50 elements), the bulk copy is slower than unsafe, I don't see the
> allocation, even if I disable inlining in JMH.
>
> The reason why copying is slower for small segments is that, I think, my
> benchmark iterations run in 4-5ns and, at these levels, you are
> sensitive to the cost of the liveness checks. There's not a lot we can
> do to eliminate that - it's the price of safety.
OK that's true. But in my own Unsafe.copyMemory code wasn't doing MemorySegment.address().toRawLongValue() not also doing the liveness check?
> Now, while benchmarking, I realized that all the benchmark I have tried
> so far were of the kind:
>
> ```
> segment.asSlice(srcOffset,
> nbytes).copyFrom(bytesSegment.asSlice(targetOffset, nbytes));
> ```
>
> In your case though, you create the heap segment on the fly, something
> like this:
>
> ```
> segment.asSlice(srcOffset,
> nbytes).copyFrom(MemorySegment.ofArray(bytes).asSlice(targetOffset,
> nbytes));
> ```
>
> And I can indeed see that the allocation profile for the two cases is
> different - in the former (segment_small_copy) there's no allocation,
> while in the latter (segment_small_copy_fresh) I can see some.
>
> ```
> TestSmallCopy.segment_small_copy avgt 30 7.039 ? 0.268 ns/op
> TestSmallCopy.segment_small_copy:?gc.alloc.rate avgt 30 ?
> 10?? MB/sec
> TestSmallCopy.segment_small_copy:?gc.alloc.rate.norm avgt 30 ?
> 10?? B/op
> TestSmallCopy.segment_small_copy:?gc.count avgt 30 ? 0
> counts
> TestSmallCopy.segment_small_copy_fresh avgt 30 6.870 ? 0.469 ns/op
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate avgt 30 0.366
> ? 1.337 MB/sec
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate.norm avgt 30
> 0.003 ? 0.010 B/op
> TestSmallCopy.segment_small_copy_fresh:?gc.churn.G1_Eden_Space avgt
> 30 0.799 ? 2.925 MB/sec
> TestSmallCopy.segment_small_copy_fresh:?gc.churn.G1_Eden_Space.norm
> avgt 30 0.006 ? 0.022 B/op
> TestSmallCopy.segment_small_copy_fresh:?gc.count avgt 30
> 1.000 counts
> TestSmallCopy.segment_small_copy_fresh:?gc.time avgt 30
> 1.000 ms
> ```
>
> After looking at this I immediately realized what the issue could be -
> and it is, albeit indirectly, related to long loop optimizations
> (again). To speed up "small" segments (whose size fits in an integer) we
> check the segment size at creation and set a flag, using this method:
>
> ```
> static int defaultAccessModes(long size) {
> return (enableSmallSegments && size < Integer.MAX_VALUE) ?
> SMALL : 0;
> }
> ```
>
> Now, doing this in the middle of code where you expect escape analysis
> to kick in is plain evil: escape analysis is super sensitive to control
> flow. And, sadly, the above code uses control flow. This means that
> escape analysis would be disabled in this particular case.
Thanks for this. I was stumbling on the SMALL segment stuff already (see my comment below).
This explains why I only see memory allocations of those array wrappers and not many slice dupes!
> The good news is that in the case of heap segment wrappers, we know
> exactly that the size is gonna fit inside an int (it's an array after
> all!), so we can remove the check. If we remove the flag check, and just
> set the segment to always be treated as "SMALL", we get the following
> results:
Is this really true? What about a huge long[] array, that could be (2^31-8 maximum index) * 8 bytesPerLong = 16 gigabytes? (or are those limited by the Java standard in maximum array size?)
>
> ```
> Benchmark Mode Cnt
> Score Error Units
> TestSmallCopy.segment_small_copy_fresh avgt 30
> 7.039 ? 0.264 ns/op
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate avgt 30 ?
> 10?? MB/sec
> TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate.norm avgt 30 ?
> 10?? B/op
> TestSmallCopy.segment_small_copy_fresh:?gc.count avgt 30
> ? 0 counts
> ```
>
> Where allocation profile is back to normal. Note that all the above
> benchmarks use `@CompilerControl(DONT_INLINE)`, so the results should be
> pretty independent on whether the method doing the copy is inlined or
> not (e.g. with @ForceInline).
>
> I will file a PR for Panama soon - if you could test that directly it
> would be great and we could even integrate into 17.
Thanks, this is great!
> Of course this doesn't remove the need for "easier to use" bulk copy
> methods, but I think we need to make sure that wrapping segments like
> your code does should not add any unwanted costs.
>
> Also note: even with the allocation fix, performance of the bulk copy
> with segments with few elements is still somewhat inferior compared with
> unsafe (because of the extra bound and liveness checks). If you are only
> copying a dozen of elements you might be better off with a loop (the
> byte buffer API had such an internal threshold, the memory segment API
> doesn't). What I'm saying is: as horrible the allocation stats might
> look, we shouldn't jump to the conclusion that the slowdown is caused by
> additional allocation. At least in the benchmark I have, fixing
> allocation does NOT bring segment bulk copy performance 100% on par with
> unsafe:
>
> ```
> Benchmark Mode Cnt Score Error Units
> TestSmallCopy.segment_small_copy avgt 30 5.464 ? 0.374 ns/op
> TestSmallCopy.segment_small_copy_fresh avgt 30 5.600 ? 0.366 ns/op
> TestSmallCopy.unsafe_small_copy avgt 30 4.468 ? 0.322 ns/op
> ```
>
> So, while we will try to fix the code in a way that rectifies the escape
> analysis woes, I think there might still be homeworks left to do :-)
Yes thanks. But this would also be some task for the MemoryCopy class in your recent pull request!
> There is a new class called MemoryCopy, with several static methods
> whose signature is vaguely similar to that of the bulk methods in the
> ByteBuffer API. So, instead of calling MemorySegment::copyFrom, you
> should call MemoryCopy:copyXYZ, which is a static method annotated with
> @ForceInline (at least to test that it works as expected).
Sorry, I missed that class. I was opening the pull request and only looked at the red/green changes in MemorySegment. I missed the first class.
The methods provided there exactly replace my three shapes in the code. And on top they allow to swap. Side note: MemoryCopy in the current PR has the swapping code commented out.
Last question: why does https://bugs.openjdk.java.net/browse/JDK-8223051 not help for the integer vs. long loop variables? Does this not allow to remove the small vs. large segment distinction? I thought this change in hotspot was added to support long loops, if it is in fact "small".
Uwe
More information about the panama-dev
mailing list