Using MemorySegment::byteSize as a loop bound is not being hoisted
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jun 27 14:27:03 UTC 2025
The trick I described can be generalized a little, by creating two
slices of fixed size inside the loop, and then using them instead of the
original segments.
```
for (; i < limit; i += FLOAT_SPECIES.vectorByteSize() * 4) {
MemorySegment a2 = a.asSlice(i,
FLOAT_SPECIES.vectorByteSize() * 4);
MemorySegment b2 = b.asSlice(i,
FLOAT_SPECIES.vectorByteSize() * 4);
FloatVector va1 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 0, LE);
FloatVector vb1 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 0, LE);
acc1 = va1.fma(vb1, acc1);
FloatVector va2 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2,
FLOAT_SPECIES.vectorByteSize(), LE);
FloatVector vb2 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2,
FLOAT_SPECIES.vectorByteSize(), LE);
acc2 = va2.fma(vb2, acc2);
FloatVector va3 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 2 *
FLOAT_SPECIES.vectorByteSize(), LE);
FloatVector vb3 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 2 *
FLOAT_SPECIES.vectorByteSize(), LE);
acc3 = va3.fma(vb3, acc3);
FloatVector va4 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 3 *
FLOAT_SPECIES.vectorByteSize(), LE);
FloatVector vb4 =
FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 3 *
FLOAT_SPECIES.vectorByteSize(), LE);
acc4 = va4.fma(vb4, acc4);
}
```
This gives me:
Benchmark Mode Cnt Score Error Units
MemorySegmentBench.dotProductArray avgt 10 69.247 ± 0.477 ns/op
MemorySegmentBench.dotProductHeapSeg avgt 10 75.491 ± 1.024 ns/op
MemorySegmentBench.dotProductNativeSeg avgt 10 76.197 ± 0.172 ns/op
Not as fast, but quite a bit faster than the original.
Maurizio
On 27/06/2025 14:49, Maurizio Cimadamore wrote:
> I also tried uncommenting your code to hardcode the limit. Alone, that
> doesn't seem to help, but if I do this:
>
> ```
> final int limit = 4096; // see how much can be got by just
> hardcoding the limit
> a = a.asSlice(0, limit);
> b = b.asSlice(0, limit);
> ```
>
> Then I get this:
>
> ```
> Benchmark Mode Cnt Score Error Units
> MemorySegmentBench.dotProductArray avgt 10 75.547 ± 0.898 ns/op
> MemorySegmentBench.dotProductHeapSeg avgt 10 78.485 ± 0.360 ns/op
> MemorySegmentBench.dotProductNativeSeg avgt 10 72.580 ± 0.305 ns/op
> ```
>
> Sometimes re-asserting the bounds of a memory segment might lead to
> some positive effects. That said, it looks as if the JVM should be
> able to do a better job here, but something seems to be failing (a
> single check hoisted out of the loop shouldn't cost 20ns). But maybe
> the issue is with the vector load intrinsics, or a bad interaction
> between that intrinsic and the memory segment bound check optimizations?
>
> Vladimir, Quan, could you please take a look?
More information about the panama-dev
mailing list