Using MemorySegment::byteSize as a loop bound is not being hoisted
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jun 27 13:23:03 UTC 2025
Hi Chris,
your benchmark seems to point out more at a performance difference
between array vs. memory segment vector load.
The memory segment load has to perform an additional liveness check, but
I doubt this is your issue here.
This leaves the bound check difference -- the array load uses the int
version of the Objects.checkIndex, whereas the segment load uses the
long version of that. Both are intrinsified, so they should work
correctly, but maybe there's something going on.
Have you tried using this:
-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
This should completely disable the bound check from the load operation.
It would be intersting to know if that eliminates the performance delta.
Maurizio
On 27/06/2025 11:28, Chris Hegarty wrote:
> Hi,
>
> I've been rewriting parts of our codebase which currently uses the
> Panama Vector API to provide optimised distance comparison functions
> for vector search algorithms. We previously used float[]'s which
> necessity a copy from our off-heap storage into the heap, so we simply
> want to use a MemorySegments to avoid this - since our stored vectors
> are in a file on-disk and mmapp'ed.
>
> I see that using MemorySegment::byteSize as a loop bound is not as
> optimised as it could be. The bound is not getting hoisted out of the
> loop body, where it does when using array length.
>
> I created a minimal jmh benchmark that demonstrates what I see
> (some assumptions are made about unrolling and tail avoidance for
> simplicity):
> https://github.com/ChrisHegarty/memseg-vector-bench/tree/main
>
> Example output
> Benchmark Mode Cnt Score Error Units
> MemorySegmentBench.dotProductArray avgt 20 61.154 ± 0.266 ns/op
> MemorySegmentBench.dotProductHeapSeg avgt 20 98.806 ± 3.143 ns/op
> MemorySegmentBench.dotProductNativeSeg avgt 20 95.282 ± 0.356 ns/op
>
> I would have expected memory segment to perform better than this, but
> maybe this is just not optimised yet on AArch64 (I've not tried x64
> yet). OR I'm doing something wrong?
>
> For now, I'm working around this by writing my own native
> implementation and linking through FFI, but this is quite a bit of
> effort just to avoid this bug. And my native implementation only gets
> me back to the array perf.
>
> -Chris.
More information about the panama-dev
mailing list