Foreign + Vectors - benchmarks for copying and swapping

Tue Jun 22 00:35:38 UTC 2021

AFAICT the generated code for `segmentImplicitScalar` with and without a constant bound is similar. In both cases there is a core loop that is efficiently unrolled 32x but not vectorized. 

However, perfasm shows that with the non-constant upper bound, most of the time is being spent in the inner post loop, implying there might be a bug in C2’s strip mining code gen.

Paul.

> On Jun 21, 2021, at 2:25 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> Replacing the upper bound in `segmentImplicitScalar` with a constant (1024 say) results in a similar time to `bufferNativeScalar` without a constant bound, both of which (alas) are still slower that scalar array access (which benefits greatly from auto-vectorization).
> 
> I wonder if the segment subrange checking for int value ranges is having an impact on bounds checking?
> 
> Paul.
> 
>> On Jun 21, 2021, at 1:56 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>> 
>> 
>> On 21/06/2021 20:33, Paul Sandoz wrote:
>>> - Segment scalar access is penalized compared to ByteBuffer (from allocate or allocateDirect) scalar access.
>> 
>> Odd
>> 
>> We have many benchmarks similar to this (see LoopOverNonConstant) and they seem to offer same level of performance compared with ByteBuffers.
>> 
>> I wonder if the loop limit being "SPECIES.loopBound(srcArray.length)" plays a role? Have you tried replacing that expression with a constant?
>> 
>> Maurizio
>> 
>