Foreign + Vectors - benchmarks for copying and swapping
Paul Sandoz
paul.sandoz at oracle.com
Tue Jun 22 00:35:38 UTC 2021
AFAICT the generated code for `segmentImplicitScalar` with and without a constant bound is similar. In both cases there is a core loop that is efficiently unrolled 32x but not vectorized.
However, perfasm shows that with the non-constant upper bound, most of the time is being spent in the inner post loop, implying there might be a bug in C2’s strip mining code gen.
Paul.
> On Jun 21, 2021, at 2:25 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> Replacing the upper bound in `segmentImplicitScalar` with a constant (1024 say) results in a similar time to `bufferNativeScalar` without a constant bound, both of which (alas) are still slower that scalar array access (which benefits greatly from auto-vectorization).
>
> I wonder if the segment subrange checking for int value ranges is having an impact on bounds checking?
>
> Paul.
>
>> On Jun 21, 2021, at 1:56 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>
>>
>> On 21/06/2021 20:33, Paul Sandoz wrote:
>>> - Segment scalar access is penalized compared to ByteBuffer (from allocate or allocateDirect) scalar access.
>>
>> Odd
>>
>> We have many benchmarks similar to this (see LoopOverNonConstant) and they seem to offer same level of performance compared with ByteBuffers.
>>
>> I wonder if the loop limit being "SPECIES.loopBound(srcArray.length)" plays a role? Have you tried replacing that expression with a constant?
>>
>> Maurizio
>>
>
More information about the panama-dev
mailing list