Memory Segment efficient array handling

Tue Jun 15 11:19:44 UTC 2021

Hi Uwe,
I'm doing other benchmark and, while I see that for _small_ segments 
(~50 elements), the bulk copy is slower than unsafe, I don't see the 
allocation, even if I disable inlining in JMH.

The reason why copying is slower for small segments is that, I think, my 
benchmark iterations run in 4-5ns and, at these levels, you are 
sensitive to the cost of the liveness checks. There's not a lot we can 
do to eliminate that - it's the price of safety.

Now, while benchmarking, I realized that all the benchmark I have tried 
so far were of the kind:

```
segment.asSlice(srcOffset, 
nbytes).copyFrom(bytesSegment.asSlice(targetOffset, nbytes));
```

In your case though, you create the heap segment on the fly, something 
like this:

```
segment.asSlice(srcOffset, 
nbytes).copyFrom(MemorySegment.ofArray(bytes).asSlice(targetOffset, 
nbytes));
```

And I can indeed see that the allocation profile for the two cases is 
different - in the former (segment_small_copy) there's no allocation, 
while in the latter (segment_small_copy_fresh) I can see some.

```
TestSmallCopy.segment_small_copy avgt   30   7.039 ?  0.268   ns/op
TestSmallCopy.segment_small_copy:?gc.alloc.rate avgt   30  ? 
10??           MB/sec
TestSmallCopy.segment_small_copy:?gc.alloc.rate.norm avgt   30  ? 
10??             B/op
TestSmallCopy.segment_small_copy:?gc.count avgt   30     ? 0           
counts
TestSmallCopy.segment_small_copy_fresh avgt   30   6.870 ?  0.469   ns/op
TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate avgt   30   0.366 
?  1.337  MB/sec
TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate.norm avgt   30   
0.003 ?  0.010    B/op
TestSmallCopy.segment_small_copy_fresh:?gc.churn.G1_Eden_Space avgt   
30   0.799 ?  2.925  MB/sec
TestSmallCopy.segment_small_copy_fresh:?gc.churn.G1_Eden_Space.norm 
avgt   30   0.006 ?  0.022    B/op
TestSmallCopy.segment_small_copy_fresh:?gc.count avgt   30   
1.000           counts
TestSmallCopy.segment_small_copy_fresh:?gc.time avgt   30   
1.000               ms
```

After looking at this I immediately realized what the issue could be - 
and it is, albeit indirectly, related to long loop optimizations 
(again). To speed up "small" segments (whose size fits in an integer) we 
check the segment size at creation and set a flag, using this method:

```
static int defaultAccessModes(long size) {
         return (enableSmallSegments && size < Integer.MAX_VALUE) ?
                 SMALL : 0;
     }
```

Now, doing this in the middle of code where you expect escape analysis 
to kick in is plain evil: escape analysis is super sensitive to control 
flow. And, sadly, the above code uses control flow. This means that 
escape analysis would be disabled in this particular case.

The good news is that in the case of heap segment wrappers, we know 
exactly that the size is gonna fit inside an int (it's an array after 
all!), so we can remove the check. If we remove the flag check, and just 
set the segment to always be treated as "SMALL", we get the following 
results:

```
Benchmark                                                   Mode Cnt   
Score    Error   Units
TestSmallCopy.segment_small_copy_fresh                      avgt 30   
7.039 ?  0.264   ns/op
TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate       avgt 30  ? 
10??           MB/sec
TestSmallCopy.segment_small_copy_fresh:?gc.alloc.rate.norm  avgt 30  ? 
10??             B/op
TestSmallCopy.segment_small_copy_fresh:?gc.count            avgt 30     
? 0           counts
```

Where allocation profile is back to normal. Note that all the above 
benchmarks use `@CompilerControl(DONT_INLINE)`, so the results should be 
pretty independent on whether the method doing the copy is inlined or 
not (e.g. with @ForceInline).

I will file a PR for Panama soon - if you could test that directly it 
would be great and we could even integrate into 17.

Of course this doesn't remove the need for "easier to use" bulk copy 
methods, but I think we need to make sure that wrapping segments like 
your code does should not add any unwanted costs.

Also note: even with the allocation fix, performance of the bulk copy 
with segments with few elements is still somewhat inferior compared with 
unsafe (because of the extra bound and liveness checks). If you are only 
copying a dozen of elements you might be better off with a loop (the 
byte buffer API had such an internal threshold, the memory segment API 
doesn't). What I'm saying is: as horrible the allocation stats might 
look, we shouldn't jump to the conclusion that the slowdown is caused by 
additional allocation. At least in the benchmark I have, fixing 
allocation does NOT bring segment bulk copy performance 100% on par with 
unsafe:

```
Benchmark                               Mode  Cnt  Score   Error Units
TestSmallCopy.segment_small_copy        avgt   30  5.464 ? 0.374 ns/op
TestSmallCopy.segment_small_copy_fresh  avgt   30  5.600 ? 0.366 ns/op
TestSmallCopy.unsafe_small_copy         avgt   30  4.468 ? 0.322 ns/op
```

So, while we will try to fix the code in a way that rectifies the escape 
analysis woes, I think there might still be homeworks left to do :-)

Some other comments inline below.

On 15/06/2021 11:28, Uwe Schindler wrote:
> Hi,
>
>> I have been keeping an eye on that.
> Thanks ��
>
>> You report that there's lots of heap allocation - but that's with JFR
>> enabled, which also instantiates its own events. Do you have a run w/o
>> JFR - is the allocation rate the same w/ and w/o JFR?
> I don't know the heap allocation rate without JFR, but if you look at the statistics printed out in the issue, the top of the allocations are in the HeapMemorySegment$OfByte code, also MemorySegment#dup() -- which is of course related to the slices we have to create in our copy between heap/off-heap method.
>
> The JFR feature was only recently added to Mike McCandless' Lucene benchmark framework (which is unfortunately not yet using JMH behind the scenes). It is very complicated to create an independent benchmark showing our problem with profile pollution, because it ay have to do with the complexity of code paths that call those I/O methods (which are the hotspot in Lucene anyways): Every single query execution calls the MemorySegmentIndexInput methods millions of times, sometimes sequential reads of different data types, then bulk reads of byte[], long[] or recently also float[] for vector stuff.
>
> This is why I'd like to suggest to provide those "easy-use" memory copy method in MemoryAccess (looking similar to mine), but hardly enforced to inlining. The ack of memory copy methods from/to plain java arrays like you have them in nio.*Buffer classes is really missing.

This is what the PR I linked does, see the MemoryCopy class it introduces

<snip>

> Hi, no I havent't done because I had no time to compile a custom JDK yet (I am undertime pressure, because I will present the results tomorrow in a talk at BerinBuzzwords conference: https://urldefense.com/v3/__https://2021.berlinbuzzwords.de/session/future-lucenes-mmapdirectory-why-use-it-and-whats-coming-java-16-and-later__;!!GqivPVa7Brio!N8KPr7fbMvwJpJDUHq4gBHKUGtIK3ZPE1zPJzB_atozEZii_1dn6-zsPb08k2qg_L_UfzOE$
>
> But what I don't understand in this pull request: How to pass in plain byte/long/float arrays without slicing and wrapping? Or is this just about our byte order issue? So we define a memory layout for the MemorySegment that uses e.g. little endian and then we call the new copy method, which makes sure that on a big endian platform all bytes are swapped?

There is a new class called MemoryCopy, with several static methods 
whose signature is vaguely similar to that of the bulk methods in the 
ByteBuffer API. So, instead of calling MemorySegment::copyFrom, you 
should call MemoryCopy:copyXYZ, which is a static method annotated with 
@ForceInline (at least to test that it works as expected).

Maurizio

>
> Thanks,
> Uwe
>
>> On 15/06/2021 10:25, Uwe Schindler wrote:
>>> Hi Maurizio,
>>>
>>> I spend a lot of time to analyze the problem. It is indeed related to the
>> wrapping of heap arrays, slicing and so on. I opened a bug report:
>>> https://bugs.openjdk.java.net/browse/JDK-8268743
>>>
>>> So please think about adding an API which is highly optimized to bulk copy
>> slices between classical on-heap arrays and MemorySegments! It looks like
>> escape analysis does not work and during our test and the heap was filled with
>> millions of HeapMemorySegment#OfByte slices! Performance degraded
>> significantly, especially due to garbage collection.
>>> Long indexes in for-loops seem to be not the issue here. We proved hat
>> replacing the wrap-byte-array, slice, copyFrom code with Unsafe.copyMemory
>> solves the issue and we have Lucene's new memory mapping implementation
>> behave similar to the old MappedByteBuffer code. (Mapped)ByteBuffer has
>> getByte(byte[], offset, count) which is missing for memory segments and that’s
>> reason for our pain!
>>> You can see the discussion on our latest pull request for JDK 17:
>> https://urldefense.com/v3/__https://github.com/apache/lucene/pull/177__;!!G
>> qivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCy
>> hxRmsJh5viv9TMmgFI$
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> uschindler at apache.org
>>> ASF Member, Member of PMC and Committer of Apache Lucene and Apache
>> Solr
>>> Bremen, Germany
>>>
>> https://urldefense.com/v3/__https://lucene.apache.org/__;!!GqivPVa7Brio!LHp
>> h3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5viv78S
>> Jz9E$
>> https://urldefense.com/v3/__https://solr.apache.org/__;!!GqivPVa7Brio!LHph3
>> VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5vivenmaL
>> YI$
>>>> -----Original Message-----
>>>> From: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
>>>> Sent: Thursday, April 1, 2021 2:36 PM
>>>> To: Uwe Schindler <uschindler at apache.org>; 'leerho' <leerho at gmail.com>;
>>>> panama-dev at openjdk.java.net
>>>> Subject: Re: Memory Segment efficient array handling
>>>>
>>>> I re-read the Lucene/Solr patch to support segments, and one thing
>>>> jumped out: in routines like readLEFloats/Longs, it seems like we do a
>>>> bulk copy if endianness match, but we do a loop copy if endianness
>>>> doesn't match.
>>>>
>>>> Reading from the ByteBufferInput impl, it doesn't seem to me that the
>>>> impl is ever falling back onto a regular loop.
>>>>
>>>> https://urldefense.com/v3/__https://github.com/apache/lucene-
>> __;!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIP
>> mLyCyhxRmsJh5vivqhwc_nk$
>> solr/blob/d2c0be5a83f8a985710e1ffbadabc70e82c54fb1/lucene/core/src/java
>>>> /org/apache/lucene/store/ByteBufferIndexInput.java#L168
>>>>
>>>> E.g. it seems  you adjust the endianness on the buffer and then use a
>>>> bulk copy.
>>>>
>>>> In other words, there might be a performance advantage in having the
>>>> bulk copy methods in MemoryAccess - which is we can take an endianness
>>>> parameter, and copy in bulk with swap (memory segment, internally, has
>>>> the ability to copy bulk with swap, like Unsafe.copySwapMemory).
>>>>
>>>> That said, I don't think this is the root cause of the perf issues you
>>>> are seeing, since readLongs is always doing a loop (even in the buffer
>>>> world), and readLELongs should do bulk copy most of the times (I assume
>>>> you ran the bench on a LE platform).
>>>>
>>>> Maurizio
>>>>
>>>>
>>>> On 01/04/2021 13:05, Maurizio Cimadamore wrote:
>>>>> On 01/04/2021 12:48, Uwe Schindler wrote:
>>>>>> In our investigations, we also see some slowdown in contrast to our
>>>>>> ByteBuffer implementation. It is not yet clear if it comes from loops
>>>>>> over long instead of ints or if it is caused by the number of object
>>>>>> allocations.
>>>>> It would be helpful if we could narrow this down. I suppose you refer
>>>>> to the benchmark regressions here:
>>>>>
>>>>> https://urldefense.com/v3/__https://github.com/apache/lucene-
>> solr/pull/2176*issuecomment-
>> 758175143__;Iw!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2
>> xprUEK_KLUIPmLyCyhxRmsJh5vivjSTNCYk$
>>>>> Which are probably not related to the issue of bulk copying.
>>>>>
>>>>> See my other email: having better MemoryAccess routines for bulk
>>>>> copying is mostly an usability thing. There's nothing to suggest that
>>>>> a straight unsafe call is faster than slicing and calling copyFrom, so
>>>>> I wouldn't look there to explain performance differences.
>>>>>
>>>>> Maurizio
>>>>>