Using MemorySegment::byteSize as a loop bound is not being hoisted

Fri Jun 27 16:57:24 UTC 2025

We're taking a look internally too (well, people more apt at C2 than I 
am :-) ).

We will share any updates we have on this, and, if needed we will file 
an issue. The fact that the more "general" version I suggested using 
`asSlice` works, makes me optimistic that is probably an issue that we 
overlooked (since my solution has exactly the same check as your solution).

Cheers
Maurizio

On 27/06/2025 17:48, Chris Hegarty wrote:
> Hi Maurizio,
>
> Thanks for looking into this.
>
> ha! that's a cute trick. I can confirm that it helps in my benchmarks 
> too.
>
> Benchmark                                           (size)  Mode Cnt 
> Score   Error  Units
> VectorUtilBenchmark.floatDotProductVector             1024  avgt 10 
> 61.771 ± 6.178  ns/op
> VectorUtilBenchmark.floatDotProductVectorHeapSeg      1024  avgt 10 
> 66.858 ± 3.078  ns/op
> VectorUtilBenchmark.floatDotProductVectorNativeSeg    1024  avgt 10 
> 65.696 ± 0.526  ns/op
>
> Should I file a JIRA to track this? Or you on it?
>
> -Chris.
>
> On 27/06/2025 15:27, Maurizio Cimadamore wrote:
>> The trick I described can be generalized a little, by creating two 
>> slices of fixed size inside the loop, and then using them instead of 
>> the original segments.
>>
>> ```
>> for (; i < limit; i += FLOAT_SPECIES.vectorByteSize() * 4) {
>>              MemorySegment a2 = a.asSlice(i, 
>> FLOAT_SPECIES.vectorByteSize() * 4);
>>              MemorySegment b2 = b.asSlice(i, 
>> FLOAT_SPECIES.vectorByteSize() * 4);
>>
>>              FloatVector va1 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 0, LE);
>>              FloatVector vb1 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 0, LE);
>>              acc1 = va1.fma(vb1, acc1);
>>
>>              FloatVector va2 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 
>> FLOAT_SPECIES.vectorByteSize(), LE);
>>              FloatVector vb2 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 
>> FLOAT_SPECIES.vectorByteSize(), LE);
>>              acc2 = va2.fma(vb2, acc2);
>>
>>              FloatVector va3 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 2 * 
>> FLOAT_SPECIES.vectorByteSize(), LE);
>>              FloatVector vb3 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 2 * 
>> FLOAT_SPECIES.vectorByteSize(), LE);
>>              acc3 = va3.fma(vb3, acc3);
>>
>>              FloatVector va4 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, a2, 3 * 
>> FLOAT_SPECIES.vectorByteSize(), LE);
>>              FloatVector vb4 = 
>> FloatVector.fromMemorySegment(FLOAT_SPECIES, b2, 3 * 
>> FLOAT_SPECIES.vectorByteSize(), LE);
>>              acc4 = va4.fma(vb4, acc4);
>>          }
>> ```
>>
>> This gives me:
>>
>> Benchmark                               Mode  Cnt   Score Error Units
>> MemorySegmentBench.dotProductArray      avgt   10  69.247 ± 0.477 ns/op
>> MemorySegmentBench.dotProductHeapSeg    avgt   10  75.491 ± 1.024 ns/op
>> MemorySegmentBench.dotProductNativeSeg  avgt   10  76.197 ± 0.172 ns/op
>>
>> Not as fast, but quite a bit faster than the original.
>>
>> Maurizio
>>
>> On 27/06/2025 14:49, Maurizio Cimadamore wrote:
>>> I also tried uncommenting your code to hardcode the limit. Alone, 
>>> that doesn't seem to help, but if I do this:
>>>
>>> ```
>>>         final int limit = 4096; // see how much can be got by just 
>>> hardcoding the limit
>>>         a = a.asSlice(0, limit);
>>>         b = b.asSlice(0, limit);
>>> ```
>>>
>>> Then I get this:
>>>
>>> ```
>>> Benchmark                               Mode  Cnt   Score Error Units
>>> MemorySegmentBench.dotProductArray      avgt   10  75.547 ± 0.898 ns/op
>>> MemorySegmentBench.dotProductHeapSeg    avgt   10  78.485 ± 0.360 ns/op
>>> MemorySegmentBench.dotProductNativeSeg  avgt   10  72.580 ± 0.305 ns/op
>>> ```
>>>
>>> Sometimes re-asserting the bounds of a memory segment might lead to 
>>> some positive effects. That said, it looks as if the JVM should be 
>>> able to do a better job here, but something seems to be failing (a 
>>> single check hoisted out of the loop shouldn't cost 20ns). But maybe 
>>> the issue is with the vector load intrinsics, or a bad interaction 
>>> between that intrinsic and the memory segment bound check 
>>> optimizations?
>>>
>>> Vladimir, Quan, could you please take a look? 
>