[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays

Wed Jun 16 08:57:20 UTC 2021

On 16/06/2021 08:49, Uwe Schindler wrote:
> On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:
>
>> After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.
>>
>> This is caused by the fact that, on segment creation, we perform this test:
>>
>>
>> static int defaultAccessModes(long size) {
>>          return (enableSmallSegments && size < Integer.MAX_VALUE) ?
>>                  SMALL : 0;
>>      }
>>
>>
>> To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.
>>
>> Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.
>>
>> For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.
>>
>> While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:
>>
>> https://github.com/openjdk/jdk/pull/2045
> Hi,
>
> we have clearly seen the allocation, but maybe bring some other part into the game: our segments are in a shared ResourceScope.
Shared scope shouldn't affect allocation profile. And slicing with 
shared scopes doesn't cost extra, only closing shared scope costs extra. 
So I think we can rule this out.
>
> When comparing the benchmarks (which contain a lot of more code than just the segment copy, because we were testing query performance) there was one thing which I have seen: With our patch that used the slicing/copyFrom variant, it was also obvious from the outside that something was wrong: each iteration of the whole Lucene benchmark was like 20 % slower in total runtime (from startup till end of each iteration) in combination with garbage collection activity (which was also seen by JFR). So something in our case was triggering the allocations. The difference in runtime between unpatched and patched lucene benchmark was worse with tiered compilation disabled (for historical reasons the Lucene benchmark runs with tiered off and -Xbatch on). See outputs on https://github.com/apache/lucene/pull/177
>
>  From your latest benchmarks it looks like in addition to the allocation we have seen, there is some other slowdown when doing the segment memorycopy. From my understanding this mostly affects small segments, because the overhead of liveness check overweighs. But isn't the liveness check not also done for varhandle access (although optimized away after some time), so rewriting to a loop would not bring much - as the lifeness check needs to be done at least at start of loop after optimizations kicking in? What is the difference in the lifeness check (on the shared memory segment) between copyMemory and a single varhandle call?

So, in part this is known: the extra checks (liveness and bounds) prior 
to memory access cost inevitably more than a plain, unchecked unsafe 
access. So, if you do a _single_ var handle access, and you benchmark it 
against a _single_ Unsafe access, it is slower - we have benchmark for 
this stuff. In reality (1) there's not much we can do (we can't compete 
with _all belts off_ mode) and (2) the use case of a single "hot" memory 
access is not very realistic. Typically you will access memory in loops. 
And in these cases the memory API can take advantage of C2 
optimizations, checks are hoisted, and no extra price is paid in the 
end. This is the same deal the ByteBuffer API does.

The issue with memory copy is that memory copy _is_ typically used in a 
standalone fashion, e.g. outside of a loop. I'm pretty sure that if I 
repeat the benchmark by wrapping it in a loop and do like 100 
iterations, the difference with unsafe will disappear. But that's _not_ 
how users will use `copyFrom`.

What remains to be seen is as to whether this extra cost (even compared 
with other APIs like ByteBuffer) is something which looks like a missing 
C2 optimization, or something not optimal on our implementation. We plan 
to investigate more and try to narrow this down.

>
>> If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.
> Sure, I can run the benchmarks with a Linux build of Panama! I can compile my own on this machine (not today or tomorrow), but if you have some tar.gz file with a prebuilt binary available it would be great.

I believe the thing that would make the most sense would be for you to 
build the bits in this branch:

https://github.com/mcimadamore/panama-foreign/tree/small_copy_benchmark/

This is like Panama repo, but it has a new static method to do the copy, 
in MemorySegment.

If using the static method instead of the slicing-based copyFrom removes 
all the allocation AND it makes benchmark return back to where they 
were, we can consider doing something like this. The benchmark I have in 
that branch doesn't seem to suggest that the static method is any faster 
than the other one, but it is always hard to generalize microbenchmarks 
results - for instance, it might be that your copy code is "cold" and is 
never touched by C2, or maybe it is touched only by one of the first 
compiler tiers.

So, let's rule out allocation and GC cost first - if you use the static 
method, you should no longer slice and dice before a copy. This should 
at least remove the allocation you see: let's see if that's enough to 
bring Lucene performance back to where it was.

Maurizio

>
> -------------
>
> PR: https://git.openjdk.java.net/panama-foreign/pull/560