[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays

Wed Jun 16 07:49:46 UTC 2021

On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.
> 
> This is caused by the fact that, on segment creation, we perform this test:
> 
> 
> static int defaultAccessModes(long size) {
>         return (enableSmallSegments && size < Integer.MAX_VALUE) ?
>                 SMALL : 0;
>     }
> 
> 
> To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.
> 
> Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.
> 
> For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.
> 
> While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:
> 
> https://github.com/openjdk/jdk/pull/2045

Hi,

we have clearly seen the allocation, but maybe bring some other part into the game: our segments are in a shared ResourceScope.

When comparing the benchmarks (which contain a lot of more code than just the segment copy, because we were testing query performance) there was one thing which I have seen: With our patch that used the slicing/copyFrom variant, it was also obvious from the outside that something was wrong: each iteration of the whole Lucene benchmark was like 20 % slower in total runtime (from startup till end of each iteration) in combination with garbage collection activity (which was also seen by JFR). So something in our case was triggering the allocations. The difference in runtime between unpatched and patched lucene benchmark was worse with tiered compilation disabled (for historical reasons the Lucene benchmark runs with tiered off and -Xbatch on). See outputs on https://github.com/apache/lucene/pull/177

>From your latest benchmarks it looks like in addition to the allocation we have seen, there is some other slowdown when doing the segment memorycopy. From my understanding this mostly affects small segments, because the overhead of liveness check overweighs. But isn't the liveness check not also done for varhandle access (although optimized away after some time), so rewriting to a loop would not bring much - as the lifeness check needs to be done at least at start of loop after optimizations kicking in? What is the difference in the lifeness check (on the shared memory segment) between copyMemory and a single varhandle call?

> If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.

Sure, I can run the benchmarks with a Linux build of Panama! I can compile my own on this machine (not today or tomorrow), but if you have some tar.gz file with a prebuilt binary available it would be great.

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/560