Using MemorySegment::byteSize as a loop bound is not being hoisted
Chris Hegarty
chegar999 at gmail.com
Fri Jun 27 16:48:59 UTC 2025
Hi Maurizio,
Thanks for looking into this.
ha! that's a cute trick. I can confirm that it helps in my benchmarks too.
Benchmark (size) Mode Cnt
Score Error Units
VectorUtilBenchmark.floatDotProductVector 1024 avgt 10
61.771 ± 6.178 ns/op
VectorUtilBenchmark.floatDotProductVectorHeapSeg 1024 avgt 10
66.858 ± 3.078 ns/op
VectorUtilBenchmark.floatDotProductVectorNativeSeg 1024 avgt 10
65.696 ± 0.526 ns/op
Should I file a JIRA to track this? Or you on it?
-Chris.
On 27/06/2025 15:27, Maurizio Cimadamore wrote:
> The trick I described can be generalized a little, by creating two
> slices of fixed size inside the loop, and then using them instead of the
> original segments.
>
> ```
> for (; i < limit; i += FLOAT_SPECIES.vectorByteSize() * 4) {
> MemorySegment a2 = a.asSlice(i,
> FLOAT_SPECIES.vectorByteSize() * 4);
> MemorySegment b2 = b.asSlice(i,
> FLOAT_SPECIES.vectorByteSize() * 4);
>
> FloatVector va1 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 0, LE);
> FloatVector vb1 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 0, LE);
> acc1 = va1.fma(vb1, acc1);
>
> FloatVector va2 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2,
> FLOAT_SPECIES.vectorByteSize(), LE);
> FloatVector vb2 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2,
> FLOAT_SPECIES.vectorByteSize(), LE);
> acc2 = va2.fma(vb2, acc2);
>
> FloatVector va3 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 2 *
> FLOAT_SPECIES.vectorByteSize(), LE);
> FloatVector vb3 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 2 *
> FLOAT_SPECIES.vectorByteSize(), LE);
> acc3 = va3.fma(vb3, acc3);
>
> FloatVector va4 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 3 *
> FLOAT_SPECIES.vectorByteSize(), LE);
> FloatVector vb4 =
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 3 *
> FLOAT_SPECIES.vectorByteSize(), LE);
> acc4 = va4.fma(vb4, acc4);
> }
> ```
>
> This gives me:
>
> Benchmark Mode Cnt Score Error Units
> MemorySegmentBench.dotProductArray avgt 10 69.247 ± 0.477 ns/op
> MemorySegmentBench.dotProductHeapSeg avgt 10 75.491 ± 1.024 ns/op
> MemorySegmentBench.dotProductNativeSeg avgt 10 76.197 ± 0.172 ns/op
>
> Not as fast, but quite a bit faster than the original.
>
> Maurizio
>
> On 27/06/2025 14:49, Maurizio Cimadamore wrote:
>> I also tried uncommenting your code to hardcode the limit. Alone, that
>> doesn't seem to help, but if I do this:
>>
>> ```
>> final int limit = 4096; // see how much can be got by just
>> hardcoding the limit
>> a = a.asSlice(0, limit);
>> b = b.asSlice(0, limit);
>> ```
>>
>> Then I get this:
>>
>> ```
>> Benchmark Mode Cnt Score Error Units
>> MemorySegmentBench.dotProductArray avgt 10 75.547 ± 0.898 ns/op
>> MemorySegmentBench.dotProductHeapSeg avgt 10 78.485 ± 0.360 ns/op
>> MemorySegmentBench.dotProductNativeSeg avgt 10 72.580 ± 0.305 ns/op
>> ```
>>
>> Sometimes re-asserting the bounds of a memory segment might lead to
>> some positive effects. That said, it looks as if the JVM should be
>> able to do a better job here, but something seems to be failing (a
>> single check hoisted out of the loop shouldn't cost 20ns). But maybe
>> the issue is with the vector load intrinsics, or a bad interaction
>> between that intrinsic and the memory segment bound check optimizations?
>>
>> Vladimir, Quan, could you please take a look?
More information about the panama-dev
mailing list