Using MemorySegment::byteSize as a loop bound is not being hoisted
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jun 27 13:49:46 UTC 2025
For the records, I'm seeing similar numbers on x64.
Setting VECTOR_ACCESS_OOB_CHECK to 0 helps a bit (in my case I go from
~90ns/op to ~70ns/op -- plain array is still a bit faster).
I also tried to run with this patch:
https://github.com/openjdk/jdk/pull/21630
which should help for short-running memory segment loops, but no luck --
there's a slight improvement like 3-4ns/op, but no more than that.
I also tried uncommenting your code to hardcode the limit. Alone, that
doesn't seem to help, but if I do this:
```
final int limit = 4096; // see how much can be got by just
hardcoding the limit
a = a.asSlice(0, limit);
b = b.asSlice(0, limit);
```
Then I get this:
```
Benchmark Mode Cnt Score Error Units
MemorySegmentBench.dotProductArray avgt 10 75.547 ± 0.898 ns/op
MemorySegmentBench.dotProductHeapSeg avgt 10 78.485 ± 0.360 ns/op
MemorySegmentBench.dotProductNativeSeg avgt 10 72.580 ± 0.305 ns/op
```
Sometimes re-asserting the bounds of a memory segment might lead to some
positive effects. That said, it looks as if the JVM should be able to do
a better job here, but something seems to be failing (a single check
hoisted out of the loop shouldn't cost 20ns). But maybe the issue is
with the vector load intrinsics, or a bad interaction between that
intrinsic and the memory segment bound check optimizations?
Vladimir, Quan, could you please take a look?
Maurizio
On 27/06/2025 14:23, Maurizio Cimadamore wrote:
> Hi Chris,
> your benchmark seems to point out more at a performance difference
> between array vs. memory segment vector load.
>
> The memory segment load has to perform an additional liveness check,
> but I doubt this is your issue here.
>
> This leaves the bound check difference -- the array load uses the int
> version of the Objects.checkIndex, whereas the segment load uses the
> long version of that. Both are intrinsified, so they should work
> correctly, but maybe there's something going on.
>
> Have you tried using this:
>
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>
> This should completely disable the bound check from the load
> operation. It would be intersting to know if that eliminates the
> performance delta.
>
> Maurizio
>
>
> On 27/06/2025 11:28, Chris Hegarty wrote:
>> Hi,
>>
>> I've been rewriting parts of our codebase which currently uses the
>> Panama Vector API to provide optimised distance comparison functions
>> for vector search algorithms. We previously used float[]'s which
>> necessity a copy from our off-heap storage into the heap, so we
>> simply want to use a MemorySegments to avoid this - since our stored
>> vectors are in a file on-disk and mmapp'ed.
>>
>> I see that using MemorySegment::byteSize as a loop bound is not as
>> optimised as it could be. The bound is not getting hoisted out of the
>> loop body, where it does when using array length.
>>
>> I created a minimal jmh benchmark that demonstrates what I see
>> (some assumptions are made about unrolling and tail avoidance for
>> simplicity):
>> https://github.com/ChrisHegarty/memseg-vector-bench/tree/main
>>
>> Example output
>> Benchmark Mode Cnt Score Error Units
>> MemorySegmentBench.dotProductArray avgt 20 61.154 ± 0.266 ns/op
>> MemorySegmentBench.dotProductHeapSeg avgt 20 98.806 ± 3.143 ns/op
>> MemorySegmentBench.dotProductNativeSeg avgt 20 95.282 ± 0.356 ns/op
>>
>> I would have expected memory segment to perform better than this, but
>> maybe this is just not optimised yet on AArch64 (I've not tried x64
>> yet). OR I'm doing something wrong?
>>
>> For now, I'm working around this by writing my own native
>> implementation and linking through FFI, but this is quite a bit of
>> effort just to avoid this bug. And my native implementation only gets
>> me back to the array perf.
>>
>> -Chris.
More information about the panama-dev
mailing list