Memory Segment efficient array handling

Tue Jun 15 09:55:34 UTC 2021

Hi Uwe,
I have been keeping an eye on that.

You report that there's lots of heap allocation - but that's with JFR 
enabled, which also instantiates its own events. Do you have a run w/o 
JFR - is the allocation rate the same w/ and w/o JFR?

I was about to ask if replacing the memory segment copy with a plain 
Unsafe::copyMemory call worked, but it seems you have done that.

The problem could be caused by lack of inlining (as you suggest) - or 
profile pollution (in case same copy routine is called with a byte[] 
memory segment and an int[] memory segment, or an off-heap memory 
segment), or both.

Have you tried using the wrapper methods we're experimenting with in:

https://github.com/openjdk/panama-foreign/pull/555

It's still raw, and I have not added in the bits which take care of 
profile pollution - but all the copy routines are wrapped with 
@ForceInline - can you please verify that this helps (or provide a JMH 
benchmark which we can try on our end) ?

Maurizio

On 15/06/2021 10:25, Uwe Schindler wrote:
> Hi Maurizio,
>
> I spend a lot of time to analyze the problem. It is indeed related to the wrapping of heap arrays, slicing and so on. I opened a bug report:
> https://bugs.openjdk.java.net/browse/JDK-8268743
>
> So please think about adding an API which is highly optimized to bulk copy slices between classical on-heap arrays and MemorySegments! It looks like escape analysis does not work and during our test and the heap was filled with millions of HeapMemorySegment#OfByte slices! Performance degraded significantly, especially due to garbage collection.
>
> Long indexes in for-loops seem to be not the issue here. We proved hat replacing the wrap-byte-array, slice, copyFrom code with Unsafe.copyMemory solves the issue and we have Lucene's new memory mapping implementation behave similar to the old MappedByteBuffer code. (Mapped)ByteBuffer has getByte(byte[], offset, count) which is missing for memory segments and that’s reason for our pain!
>
> You can see the discussion on our latest pull request for JDK 17: https://urldefense.com/v3/__https://github.com/apache/lucene/pull/177__;!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5viv9TMmgFI$
>
> Uwe
>
> -----
> Uwe Schindler
> uschindler at apache.org
> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
> Bremen, Germany
> https://urldefense.com/v3/__https://lucene.apache.org/__;!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5viv78SJz9E$
> https://urldefense.com/v3/__https://solr.apache.org/__;!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5vivenmaLYI$
>
>> -----Original Message-----
>> From: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
>> Sent: Thursday, April 1, 2021 2:36 PM
>> To: Uwe Schindler <uschindler at apache.org>; 'leerho' <leerho at gmail.com>;
>> panama-dev at openjdk.java.net
>> Subject: Re: Memory Segment efficient array handling
>>
>> I re-read the Lucene/Solr patch to support segments, and one thing
>> jumped out: in routines like readLEFloats/Longs, it seems like we do a
>> bulk copy if endianness match, but we do a loop copy if endianness
>> doesn't match.
>>
>> Reading from the ByteBufferInput impl, it doesn't seem to me that the
>> impl is ever falling back onto a regular loop.
>>
>> https://urldefense.com/v3/__https://github.com/apache/lucene-__;!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5vivqhwc_nk$
>> solr/blob/d2c0be5a83f8a985710e1ffbadabc70e82c54fb1/lucene/core/src/java
>> /org/apache/lucene/store/ByteBufferIndexInput.java#L168
>>
>> E.g. it seems  you adjust the endianness on the buffer and then use a
>> bulk copy.
>>
>> In other words, there might be a performance advantage in having the
>> bulk copy methods in MemoryAccess - which is we can take an endianness
>> parameter, and copy in bulk with swap (memory segment, internally, has
>> the ability to copy bulk with swap, like Unsafe.copySwapMemory).
>>
>> That said, I don't think this is the root cause of the perf issues you
>> are seeing, since readLongs is always doing a loop (even in the buffer
>> world), and readLELongs should do bulk copy most of the times (I assume
>> you ran the bench on a LE platform).
>>
>> Maurizio
>>
>>
>> On 01/04/2021 13:05, Maurizio Cimadamore wrote:
>>> On 01/04/2021 12:48, Uwe Schindler wrote:
>>>> In our investigations, we also see some slowdown in contrast to our
>>>> ByteBuffer implementation. It is not yet clear if it comes from loops
>>>> over long instead of ints or if it is caused by the number of object
>>>> allocations.
>>> It would be helpful if we could narrow this down. I suppose you refer
>>> to the benchmark regressions here:
>>>
>>> https://urldefense.com/v3/__https://github.com/apache/lucene-solr/pull/2176*issuecomment-758175143__;Iw!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5vivjSTNCYk$
>>>
>>> Which are probably not related to the issue of bulk copying.
>>>
>>> See my other email: having better MemoryAccess routines for bulk
>>> copying is mostly an usability thing. There's nothing to suggest that
>>> a straight unsafe call is faster than slicing and calling copyFrom, so
>>> I wouldn't look there to explain performance differences.
>>>
>>> Maurizio
>>>