Memory Segment efficient array handling

Thu Apr 1 15:32:54 UTC 2021

Hi Maurizo,

I agree with your text. In our internal discussion with Robert Muir we already figured out that it might be a problem with the long indexes and not the wrapping. My text was more or less mainly meant that the verbosity of the code is horrible.

Thanks for the hint about the endianness. I would really like to see a method to allow bulk copying from MemorySegments to/from arrays, with endianness.

> That said, I don't think this is the root cause of the perf issues you
> are seeing, since readLongs is always doing a loop (even in the buffer
> world), and readLELongs should do bulk copy most of the times (I assume
> you ran the bench on a LE platform).

Yes, no issue. Recently, Lucene started to change its file formats to be fully little endian, because Lucene/Solr/Elasticsearch is mostly running on LE machines. We only do performance tests with LE machines at moment.

The "super.readLELongs()" call is exactly there as fall back to loop copy code. That's part for improvement (see above). ByteBuffer is better here.

FYI, I am planning to have another version of Lucene's MMapDirectory for performance testing, which uses MappedMemorySegment just to do the mapping (so have a single huge mapping and not many mmaps with size=1GiB as we do with MappedByteBuffer). The code would use good old ByteBufferIndexInput (without the ByteBufferGuard), by slicing the huge (16 GiB) MappedMemorySegment into 1 GiB slices. The performance then should be identical to current Lucene code, the only improvement would be that it is "safe" (no crashes, as we can safely unmap). So my plan would be (until performance issues are understood):

Old code: Map the (huge) file into many MappedByteBuffer of 1 GiB size using Filechannel.map(). Problems: Many file mappings, linux kernel may complain because of sysctl vm.max_map_count; no clean way to unmap -> SIGBUS/SIGSEGV risk
New code: Map the huge file into few MappedMemorySegments (not too large to prevent fragmentation issues). Each memory segment sliced using MemorySegment.asSlice(...).asByteBuffer(). Access speed should be identical to current code, just less mappings in kernel needed and we have safe unmapping.

Uwe

-----
Uwe Schindler
uschindler at apache.org 
ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
Bremen, Germany
https://lucene.apache.org/
https://solr.apache.org/

> -----Original Message-----
> From: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> Sent: Thursday, April 1, 2021 2:36 PM
> To: Uwe Schindler <uschindler at apache.org>; 'leerho' <leerho at gmail.com>;
> panama-dev at openjdk.java.net
> Subject: Re: Memory Segment efficient array handling
> 
> I re-read the Lucene/Solr patch to support segments, and one thing
> jumped out: in routines like readLEFloats/Longs, it seems like we do a
> bulk copy if endianness match, but we do a loop copy if endianness
> doesn't match.
> 
> Reading from the ByteBufferInput impl, it doesn't seem to me that the
> impl is ever falling back onto a regular loop.
> 
> https://github.com/apache/lucene-
> solr/blob/d2c0be5a83f8a985710e1ffbadabc70e82c54fb1/lucene/core/src/java
> /org/apache/lucene/store/ByteBufferIndexInput.java#L168
> 
> E.g. it seems  you adjust the endianness on the buffer and then use a
> bulk copy.
> 
> In other words, there might be a performance advantage in having the
> bulk copy methods in MemoryAccess - which is we can take an endianness
> parameter, and copy in bulk with swap (memory segment, internally, has
> the ability to copy bulk with swap, like Unsafe.copySwapMemory).
> 
> That said, I don't think this is the root cause of the perf issues you
> are seeing, since readLongs is always doing a loop (even in the buffer
> world), and readLELongs should do bulk copy most of the times (I assume
> you ran the bench on a LE platform).
> 
> Maurizio
> 
> 
> On 01/04/2021 13:05, Maurizio Cimadamore wrote:
> >
> > On 01/04/2021 12:48, Uwe Schindler wrote:
> >> In our investigations, we also see some slowdown in contrast to our
> >> ByteBuffer implementation. It is not yet clear if it comes from loops
> >> over long instead of ints or if it is caused by the number of object
> >> allocations.
> >
> > It would be helpful if we could narrow this down. I suppose you refer
> > to the benchmark regressions here:
> >
> > https://github.com/apache/lucene-solr/pull/2176#issuecomment-758175143
> >
> > Which are probably not related to the issue of bulk copying.
> >
> > See my other email: having better MemoryAccess routines for bulk
> > copying is mostly an usability thing. There's nothing to suggest that
> > a straight unsafe call is faster than slicing and calling copyFrom, so
> > I wouldn't look there to explain performance differences.
> >
> > Maurizio
> >