Using MemorySegment::byteSize as a loop bound is not being hoisted

Fri Jun 27 13:49:46 UTC 2025

For the records, I'm seeing similar numbers on x64.

Setting VECTOR_ACCESS_OOB_CHECK to 0 helps a bit (in my case I go from 
~90ns/op to ~70ns/op -- plain array is still a bit faster).

I also tried to run with this patch:

https://github.com/openjdk/jdk/pull/21630

which should help for short-running memory segment loops, but no luck -- 
there's a slight improvement like 3-4ns/op, but no more than that.

I also tried uncommenting your code to hardcode the limit. Alone, that 
doesn't seem to help, but if I do this:

```
         final int limit = 4096; // see how much can be got by just 
hardcoding the limit
         a = a.asSlice(0, limit);
         b = b.asSlice(0, limit);
```

Then I get this:

```
Benchmark                               Mode  Cnt   Score   Error Units
MemorySegmentBench.dotProductArray      avgt   10  75.547 ± 0.898 ns/op
MemorySegmentBench.dotProductHeapSeg    avgt   10  78.485 ± 0.360 ns/op
MemorySegmentBench.dotProductNativeSeg  avgt   10  72.580 ± 0.305 ns/op
```

Sometimes re-asserting the bounds of a memory segment might lead to some 
positive effects. That said, it looks as if the JVM should be able to do 
a better job here, but something seems to be failing (a single check 
hoisted out of the loop shouldn't cost 20ns). But maybe the issue is 
with the vector load intrinsics, or a bad interaction between that 
intrinsic and the memory segment bound check optimizations?

Vladimir, Quan, could you please take a look?

Maurizio

On 27/06/2025 14:23, Maurizio Cimadamore wrote:
> Hi Chris,
> your benchmark seems to point out more at a performance difference 
> between array vs. memory segment vector load.
>
> The memory segment load has to perform an additional liveness check, 
> but I doubt this is your issue here.
>
> This leaves the bound check difference -- the array load uses the int 
> version of the Objects.checkIndex, whereas the segment load uses the 
> long version of that. Both are intrinsified, so they should work 
> correctly, but maybe there's something going on.
>
> Have you tried using this:
>
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>
> This should completely disable the bound check from the load 
> operation. It would be intersting to know if that eliminates the 
> performance delta.
>
> Maurizio
>
>
> On 27/06/2025 11:28, Chris Hegarty wrote:
>> Hi,
>>
>> I've been rewriting parts of our codebase which currently uses the 
>> Panama Vector API to provide optimised distance comparison functions 
>> for vector search algorithms. We previously used float[]'s which 
>> necessity a copy from our off-heap storage into the heap, so we 
>> simply want to use a MemorySegments to avoid this - since our stored 
>> vectors are in a file on-disk and mmapp'ed.
>>
>> I see that using MemorySegment::byteSize as a loop bound is not as 
>> optimised as it could be. The bound is not getting hoisted out of the 
>> loop body, where it does when using array length.
>>
>> I created a minimal jmh benchmark that demonstrates what I see
>> (some assumptions are made about unrolling and tail avoidance for 
>> simplicity):
>> https://github.com/ChrisHegarty/memseg-vector-bench/tree/main
>>
>> Example output
>> Benchmark                               Mode  Cnt   Score Error Units
>> MemorySegmentBench.dotProductArray      avgt   20  61.154 ± 0.266 ns/op
>> MemorySegmentBench.dotProductHeapSeg    avgt   20  98.806 ± 3.143 ns/op
>> MemorySegmentBench.dotProductNativeSeg  avgt   20  95.282 ± 0.356 ns/op
>>
>> I would have expected memory segment to perform better than this, but 
>> maybe this is just not optimised yet on AArch64 (I've not tried x64 
>> yet). OR I'm doing something wrong?
>>
>> For now, I'm working around this by writing my own native 
>> implementation and linking through FFI, but this is quite a bit of 
>> effort just to avoid this bug. And my native implementation only gets 
>> me back to the array perf.
>>
>> -Chris.