[foreign-memaccess+abi] RFR: 8268743: Require a better way for copying data between MemorySegments and on-heap arrays

Tue Jun 15 18:00:52 UTC 2021

On Tue, 15 Jun 2021 13:05:28 GMT, Uwe Schindler <uschindler at openjdk.org> wrote:

>>> But we have no solution for larger sizes like float[] or long[] arrays?
>> 
>> Not for now - the real solution is to fix the performance woes which makes the implementation do all these workarounds for longs vs. ints. That said, I'd be curious, once this is integrated, if you could do a quick validation and re-run that benchmark where you were seeing lots of byte segment wrappers in the heap and make sure at least that behaves as expected.
>
>> > But we have no solution for larger sizes like float[] or long[] arrays?
>> 
>> Not for now - the real solution is to fix the performance woes which makes the implementation do all these workarounds for longs vs. ints. That said, I'd be curious, once this is integrated, if you could do a quick validation and re-run that benchmark where you were seeing lots of byte segment wrappers in the heap and make sure at least that behaves as expected.
> 
> For sure.
> 
> @rmuir and I already discussed about that. From what we understand: `readLong(long[], offset, count)` in our case is never reading more than 64 longs because of the limitations of PFOR algorithm inside Lucene. So we would remove the specialization in MemorySegmentIndexinput and the superclass will use a loop of readLong() instead. This also goes in line with your hint to copyMemory liveness check overhead. The readFloat variant we have is reading a maximum of 1024 float dimensions for our vector suppotr, I will do some quick investigations later, but I tend to remove the specialization, too. In future we will use FloatVector from vector API and that should possibly be wrapped over the memory segment (see also https://github.com/apache/lucene/pull/18 for some quick investigations). For the longs and the PFOR algorithm we may also use a vectorized approach in future, so readFloat() and readLongs() will go away and be replaced to return a FloatVector/LongVector view on the memory segm
 ent (using ByteBuffer view or directly once panama and vector can share APIs).
> 
> The biggest problem on our side is our implementation of `readBytes(byte[], offset, count)` which is fixed here.

Adding a static method which avoids all the slicing doesn't seem to help performance-wise. It is still slower than an Unsafe call (20-25%). If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/560