Memory Segment efficient array handling

Tue Jun 15 10:28:05 UTC 2021

Hi,

> I have been keeping an eye on that.

Thanks ��

> You report that there's lots of heap allocation - but that's with JFR
> enabled, which also instantiates its own events. Do you have a run w/o
> JFR - is the allocation rate the same w/ and w/o JFR?

I don't know the heap allocation rate without JFR, but if you look at the statistics printed out in the issue, the top of the allocations are in the HeapMemorySegment$OfByte code, also MemorySegment#dup() -- which is of course related to the slices we have to create in our copy between heap/off-heap method.

The JFR feature was only recently added to Mike McCandless' Lucene benchmark framework (which is unfortunately not yet using JMH behind the scenes). It is very complicated to create an independent benchmark showing our problem with profile pollution, because it ay have to do with the complexity of code paths that call those I/O methods (which are the hotspot in Lucene anyways): Every single query execution calls the MemorySegmentIndexInput methods millions of times, sometimes sequential reads of different data types, then bulk reads of byte[], long[] or recently also float[] for vector stuff.

This is why I'd like to suggest to provide those "easy-use" memory copy method in MemoryAccess (looking similar to mine), but hardly enforced to inlining. The ack of memory copy methods from/to plain java arrays like you have them in nio.*Buffer classes is really missing.

> I was about to ask if replacing the memory segment copy with a plain
> Unsafe::copyMemory call worked, but it seems you have done that.

Yes, this made the problem go away completely, after that the heap statistics looked identical to our previous code - With and without JFR! I spend yesterday evening with that. I will update the pull request to have the used code inside, too.

> The problem could be caused by lack of inlining (as you suggest) - or
> profile pollution (in case same copy routine is called with a byte[]
> memory segment and an int[] memory segment, or an off-heap memory
> segment), or both.

In our case, we always pass a MappedMemorySegment to the above method. I also tried reshuffling the methods, but no success: whenever MemorySegment#ofArray() was involved, the heap allocations happened. For byte[] and long[] and float[] we have separate code paths all with sharp types.

> Have you tried using the wrapper methods we're experimenting with in:
> 
> https://github.com/openjdk/panama-foreign/pull/555
> 
> It's still raw, and I have not added in the bits which take care of
> profile pollution - but all the copy routines are wrapped with
> @ForceInline - can you please verify that this helps (or provide a JMH
> benchmark which we can try on our end) ?

Hi, no I havent't done because I had no time to compile a custom JDK yet (I am undertime pressure, because I will present the results tomorrow in a talk at BerinBuzzwords conference: https://2021.berlinbuzzwords.de/session/future-lucenes-mmapdirectory-why-use-it-and-whats-coming-java-16-and-later

But what I don't understand in this pull request: How to pass in plain byte/long/float arrays without slicing and wrapping? Or is this just about our byte order issue? So we define a memory layout for the MemorySegment that uses e.g. little endian and then we call the new copy method, which makes sure that on a big endian platform all bytes are swapped?

Thanks, 
Uwe

> On 15/06/2021 10:25, Uwe Schindler wrote:
> > Hi Maurizio,
> >
> > I spend a lot of time to analyze the problem. It is indeed related to the
> wrapping of heap arrays, slicing and so on. I opened a bug report:
> > https://bugs.openjdk.java.net/browse/JDK-8268743
> >
> > So please think about adding an API which is highly optimized to bulk copy
> slices between classical on-heap arrays and MemorySegments! It looks like
> escape analysis does not work and during our test and the heap was filled with
> millions of HeapMemorySegment#OfByte slices! Performance degraded
> significantly, especially due to garbage collection.
> >
> > Long indexes in for-loops seem to be not the issue here. We proved hat
> replacing the wrap-byte-array, slice, copyFrom code with Unsafe.copyMemory
> solves the issue and we have Lucene's new memory mapping implementation
> behave similar to the old MappedByteBuffer code. (Mapped)ByteBuffer has
> getByte(byte[], offset, count) which is missing for memory segments and that’s
> reason for our pain!
> >
> > You can see the discussion on our latest pull request for JDK 17:
> https://urldefense.com/v3/__https://github.com/apache/lucene/pull/177__;!!G
> qivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCy
> hxRmsJh5viv9TMmgFI$
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > uschindler at apache.org
> > ASF Member, Member of PMC and Committer of Apache Lucene and Apache
> Solr
> > Bremen, Germany
> >
> https://urldefense.com/v3/__https://lucene.apache.org/__;!!GqivPVa7Brio!LHp
> h3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5viv78S
> Jz9E$
> >
> https://urldefense.com/v3/__https://solr.apache.org/__;!!GqivPVa7Brio!LHph3
> VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIPmLyCyhxRmsJh5vivenmaL
> YI$
> >
> >> -----Original Message-----
> >> From: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> >> Sent: Thursday, April 1, 2021 2:36 PM
> >> To: Uwe Schindler <uschindler at apache.org>; 'leerho' <leerho at gmail.com>;
> >> panama-dev at openjdk.java.net
> >> Subject: Re: Memory Segment efficient array handling
> >>
> >> I re-read the Lucene/Solr patch to support segments, and one thing
> >> jumped out: in routines like readLEFloats/Longs, it seems like we do a
> >> bulk copy if endianness match, but we do a loop copy if endianness
> >> doesn't match.
> >>
> >> Reading from the ByteBufferInput impl, it doesn't seem to me that the
> >> impl is ever falling back onto a regular loop.
> >>
> >> https://urldefense.com/v3/__https://github.com/apache/lucene-
> __;!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2xprUEK_KLUIP
> mLyCyhxRmsJh5vivqhwc_nk$
> >>
> solr/blob/d2c0be5a83f8a985710e1ffbadabc70e82c54fb1/lucene/core/src/java
> >> /org/apache/lucene/store/ByteBufferIndexInput.java#L168
> >>
> >> E.g. it seems  you adjust the endianness on the buffer and then use a
> >> bulk copy.
> >>
> >> In other words, there might be a performance advantage in having the
> >> bulk copy methods in MemoryAccess - which is we can take an endianness
> >> parameter, and copy in bulk with swap (memory segment, internally, has
> >> the ability to copy bulk with swap, like Unsafe.copySwapMemory).
> >>
> >> That said, I don't think this is the root cause of the perf issues you
> >> are seeing, since readLongs is always doing a loop (even in the buffer
> >> world), and readLELongs should do bulk copy most of the times (I assume
> >> you ran the bench on a LE platform).
> >>
> >> Maurizio
> >>
> >>
> >> On 01/04/2021 13:05, Maurizio Cimadamore wrote:
> >>> On 01/04/2021 12:48, Uwe Schindler wrote:
> >>>> In our investigations, we also see some slowdown in contrast to our
> >>>> ByteBuffer implementation. It is not yet clear if it comes from loops
> >>>> over long instead of ints or if it is caused by the number of object
> >>>> allocations.
> >>> It would be helpful if we could narrow this down. I suppose you refer
> >>> to the benchmark regressions here:
> >>>
> >>> https://urldefense.com/v3/__https://github.com/apache/lucene-
> solr/pull/2176*issuecomment-
> 758175143__;Iw!!GqivPVa7Brio!LHph3VUPiAkTLYY0J9EGarZ8JCdW5uOOekM2
> xprUEK_KLUIPmLyCyhxRmsJh5vivjSTNCYk$
> >>>
> >>> Which are probably not related to the issue of bulk copying.
> >>>
> >>> See my other email: having better MemoryAccess routines for bulk
> >>> copying is mostly an usability thing. There's nothing to suggest that
> >>> a straight unsafe call is faster than slicing and calling copyFrom, so
> >>> I wouldn't look there to explain performance differences.
> >>>
> >>> Maurizio
> >>>