Using MemorySegment::byteSize as a loop bound is not being hoisted

Fri Jun 27 16:48:59 UTC 2025

Hi Maurizio,

Thanks for looking into this.

ha! that's a cute trick. I can confirm that it helps in my benchmarks too.

Benchmark                                           (size)  Mode  Cnt 
Score   Error  Units
VectorUtilBenchmark.floatDotProductVector             1024  avgt   10 
61.771 ± 6.178  ns/op
VectorUtilBenchmark.floatDotProductVectorHeapSeg      1024  avgt   10 
66.858 ± 3.078  ns/op
VectorUtilBenchmark.floatDotProductVectorNativeSeg    1024  avgt   10 
65.696 ± 0.526  ns/op

Should I file a JIRA to track this? Or you on it?

-Chris.

On 27/06/2025 15:27, Maurizio Cimadamore wrote:
> The trick I described can be generalized a little, by creating two 
> slices of fixed size inside the loop, and then using them instead of the 
> original segments.
> 
> ```
> for (; i < limit; i += FLOAT_SPECIES.vectorByteSize() * 4) {
>              MemorySegment a2 = a.asSlice(i, 
> FLOAT_SPECIES.vectorByteSize() * 4);
>              MemorySegment b2 = b.asSlice(i, 
> FLOAT_SPECIES.vectorByteSize() * 4);
> 
>              FloatVector va1 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 0, LE);
>              FloatVector vb1 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 0, LE);
>              acc1 = va1.fma(vb1, acc1);
> 
>              FloatVector va2 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 
> FLOAT_SPECIES.vectorByteSize(), LE);
>              FloatVector vb2 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 
> FLOAT_SPECIES.vectorByteSize(), LE);
>              acc2 = va2.fma(vb2, acc2);
> 
>              FloatVector va3 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 2 * 
> FLOAT_SPECIES.vectorByteSize(), LE);
>              FloatVector vb3 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 2 * 
> FLOAT_SPECIES.vectorByteSize(), LE);
>              acc3 = va3.fma(vb3, acc3);
> 
>              FloatVector va4 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 3 * 
> FLOAT_SPECIES.vectorByteSize(), LE);
>              FloatVector vb4 = 
> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 3 * 
> FLOAT_SPECIES.vectorByteSize(), LE);
>              acc4 = va4.fma(vb4, acc4);
>          }
> ```
> 
> This gives me:
> 
> Benchmark                               Mode  Cnt   Score Error  Units
> MemorySegmentBench.dotProductArray      avgt   10  69.247 ± 0.477 ns/op
> MemorySegmentBench.dotProductHeapSeg    avgt   10  75.491 ± 1.024 ns/op
> MemorySegmentBench.dotProductNativeSeg  avgt   10  76.197 ± 0.172 ns/op
> 
> Not as fast, but quite a bit faster than the original.
> 
> Maurizio
> 
> On 27/06/2025 14:49, Maurizio Cimadamore wrote:
>> I also tried uncommenting your code to hardcode the limit. Alone, that 
>> doesn't seem to help, but if I do this:
>>
>> ```
>>         final int limit = 4096; // see how much can be got by just 
>> hardcoding the limit
>>         a = a.asSlice(0, limit);
>>         b = b.asSlice(0, limit);
>> ```
>>
>> Then I get this:
>>
>> ```
>> Benchmark                               Mode  Cnt   Score   Error Units
>> MemorySegmentBench.dotProductArray      avgt   10  75.547 ± 0.898 ns/op
>> MemorySegmentBench.dotProductHeapSeg    avgt   10  78.485 ± 0.360 ns/op
>> MemorySegmentBench.dotProductNativeSeg  avgt   10  72.580 ± 0.305 ns/op
>> ```
>>
>> Sometimes re-asserting the bounds of a memory segment might lead to 
>> some positive effects. That said, it looks as if the JVM should be 
>> able to do a better job here, but something seems to be failing (a 
>> single check hoisted out of the loop shouldn't cost 20ns). But maybe 
>> the issue is with the vector load intrinsics, or a bad interaction 
>> between that intrinsic and the memory segment bound check optimizations?
>>
>> Vladimir, Quan, could you please take a look?