Using MemorySegment::byteSize as a loop bound is not being hoisted

Fri Jun 27 13:23:03 UTC 2025

Hi Chris,
your benchmark seems to point out more at a performance difference 
between array vs. memory segment vector load.

The memory segment load has to perform an additional liveness check, but 
I doubt this is your issue here.

This leaves the bound check difference -- the array load uses the int 
version of the Objects.checkIndex, whereas the segment load uses the 
long version of that. Both are intrinsified, so they should work 
correctly, but maybe there's something going on.

Have you tried using this:

-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0

This should completely disable the bound check from the load operation. 
It would be intersting to know if that eliminates the performance delta.

Maurizio

On 27/06/2025 11:28, Chris Hegarty wrote:
> Hi,
>
> I've been rewriting parts of our codebase which currently uses the 
> Panama Vector API to provide optimised distance comparison functions 
> for vector search algorithms. We previously used float[]'s which 
> necessity a copy from our off-heap storage into the heap, so we simply 
> want to use a MemorySegments to avoid this - since our stored vectors 
> are in a file on-disk and mmapp'ed.
>
> I see that using MemorySegment::byteSize as a loop bound is not as 
> optimised as it could be. The bound is not getting hoisted out of the 
> loop body, where it does when using array length.
>
> I created a minimal jmh benchmark that demonstrates what I see
> (some assumptions are made about unrolling and tail avoidance for 
> simplicity):
> https://github.com/ChrisHegarty/memseg-vector-bench/tree/main
>
> Example output
> Benchmark                               Mode  Cnt   Score   Error Units
> MemorySegmentBench.dotProductArray      avgt   20  61.154 ± 0.266 ns/op
> MemorySegmentBench.dotProductHeapSeg    avgt   20  98.806 ± 3.143 ns/op
> MemorySegmentBench.dotProductNativeSeg  avgt   20  95.282 ± 0.356 ns/op
>
> I would have expected memory segment to perform better than this, but 
> maybe this is just not optimised yet on AArch64 (I've not tried x64 
> yet). OR I'm doing something wrong?
>
> For now, I'm working around this by writing my own native 
> implementation and linking through FFI, but this is quite a bit of 
> effort just to avoid this bug. And my native implementation only gets 
> me back to the array perf.
>
> -Chris.