Using MemorySegment::byteSize as a loop bound is not being hoisted

Fri Jun 27 14:27:03 UTC 2025

The trick I described can be generalized a little, by creating two 
slices of fixed size inside the loop, and then using them instead of the 
original segments.

```
for (; i < limit; i += FLOAT_SPECIES.vectorByteSize() * 4) {
             MemorySegment a2 = a.asSlice(i, 
FLOAT_SPECIES.vectorByteSize() * 4);
             MemorySegment b2 = b.asSlice(i, 
FLOAT_SPECIES.vectorByteSize() * 4);

             FloatVector va1 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 0, LE);
             FloatVector vb1 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 0, LE);
             acc1 = va1.fma(vb1, acc1);

             FloatVector va2 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 
FLOAT_SPECIES.vectorByteSize(), LE);
             FloatVector vb2 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 
FLOAT_SPECIES.vectorByteSize(), LE);
             acc2 = va2.fma(vb2, acc2);

             FloatVector va3 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 2 * 
FLOAT_SPECIES.vectorByteSize(), LE);
             FloatVector vb3 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 2 * 
FLOAT_SPECIES.vectorByteSize(), LE);
             acc3 = va3.fma(vb3, acc3);

             FloatVector va4 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 3 * 
FLOAT_SPECIES.vectorByteSize(), LE);
             FloatVector vb4 = 
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 3 * 
FLOAT_SPECIES.vectorByteSize(), LE);
             acc4 = va4.fma(vb4, acc4);
         }
```

This gives me:

Benchmark                               Mode  Cnt   Score Error  Units
MemorySegmentBench.dotProductArray      avgt   10  69.247 ± 0.477 ns/op
MemorySegmentBench.dotProductHeapSeg    avgt   10  75.491 ± 1.024 ns/op
MemorySegmentBench.dotProductNativeSeg  avgt   10  76.197 ± 0.172 ns/op

Not as fast, but quite a bit faster than the original.

Maurizio

On 27/06/2025 14:49, Maurizio Cimadamore wrote:
> I also tried uncommenting your code to hardcode the limit. Alone, that 
> doesn't seem to help, but if I do this:
>
> ```
>         final int limit = 4096; // see how much can be got by just 
> hardcoding the limit
>         a = a.asSlice(0, limit);
>         b = b.asSlice(0, limit);
> ```
>
> Then I get this:
>
> ```
> Benchmark                               Mode  Cnt   Score   Error Units
> MemorySegmentBench.dotProductArray      avgt   10  75.547 ± 0.898 ns/op
> MemorySegmentBench.dotProductHeapSeg    avgt   10  78.485 ± 0.360 ns/op
> MemorySegmentBench.dotProductNativeSeg  avgt   10  72.580 ± 0.305 ns/op
> ```
>
> Sometimes re-asserting the bounds of a memory segment might lead to 
> some positive effects. That said, it looks as if the JVM should be 
> able to do a better job here, but something seems to be failing (a 
> single check hoisted out of the loop shouldn't cost 20ns). But maybe 
> the issue is with the vector load intrinsics, or a bad interaction 
> between that intrinsic and the memory segment bound check optimizations?
>
> Vladimir, Quan, could you please take a look?