ODP: Foreign + Vectors - benchmarks for copying and swapping

Radosław Smogura mail at smogura.eu
Thu Jun 24 21:38:32 UTC 2021


Hi,

That's a good point - worth checking.

For now I found with debugger may not be the best due to OSR

For TestLoadStoreInts running it with -XX:-UseOnStackReplacement -jvmArgs=-XX:-UseLoopCounter does not give significant changes. For smaller regions it's even slower (but this can be due to benchamrk logic in JMH it self - I guess there's some loop which will not get compiled).

However for tests VectorCopySegments.copyWithVector there's change  25521.137 -> 23148.839 ns/op. I guess it's because the method gets fully compiled.

BR,
Rado
________________________________
Od: Paul Sandoz <paul.sandoz at oracle.com>
Wysłane: czwartek, 24 czerwca 2021 17:26
Do: Radosław Smogura <mail at smogura.eu>
DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: Re: Foreign + Vectors - benchmarks for copying and swapping

Thanks for checking other types. A curious restriction! Perhaps we can gain more insight by tracing with a fast debug build? e.g. see trace usages in src/hotspot/share/opto/loopTransform.cpp

Paul.

> On Jun 23, 2021, at 3:23 PM, Radosław Smogura <mail at smogura.eu> wrote:
>
> Hi Paul,
>
> Thank you.
>
> I checked with LongVector and got same results as int (4x loop unroll). With short vector I could not unroll. I only focused on method bufferHeap.
>
> Kind regards,
> Rado
>
> Od: Paul Sandoz <paul.sandoz at oracle.com>
> Wysłane: środa, 23 czerwca 2021 22:32
> Do: Radosław Smogura <mail at smogura.eu>
> DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
>
> Here you go:
>
> https://gist.github.com/PaulSandoz/7d95a4d9b99b5f9f9c6326f65a4d77c8
>
> Details in comments at the end.
>
> Paul.
>
> > On Jun 23, 2021, at 11:41 AM, Radosław Smogura <mail at smogura.eu> wrote:
> >
> > Hi Paul,
> >
> > Can you share a code, I could not unroll loop. I can only eliminate range checks and that's all.
> >
> > In fact it's bit odd, as the code for loading int and byte vectors looks like same.
> >
> > I've got few suspicions why ByteBuffer vectors can be harder to optimize:
> >        • array length is taken from constant memory
> >        • array length is non-negative
> >
> > Kind regards,
> > Rado
> >
> > Od: Paul Sandoz <paul.sandoz at oracle.com>
> > Wysłane: wtorek, 22 czerwca 2021 22:29
> > Do: Radosław Smogura <mail at smogura.eu>
> > DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> > Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
> >
> > In general that should be ok. Try using IntVector instead and it will unroll (with your patch removing CPU barriers)
> >
> > I wonder if this may be a limitation specific to bytes.
> >
> > Paul.
> >
> > > On Jun 21, 2021, at 4:28 PM, Radosław Smogura <mail at smogura.eu> wrote:
> > >
> > > Hi,
> > >
> > > I think why the copy case may fail with unrolling, because
> > >        • loop unroll takes the range check from intoByteBuffer as the loop exit condition
> > >        • the range check uses unsigned compare, which is not supported by loop unroll
> > >
> > > I think in this code
> > >         for (int i = 0; i < bound; i += lanes) {
> > >           final var srcVector = ByteVector
> > >               .fromByteBuffer(BYTE_VECTOR_SPECIES, src, i, ByteOrder.nativeOrder());
> > >
> > >           srcVector.intoByteBuffer(dst, i, ByteOrder.nativeOrder());
> > >         }
> > > exit condition should be i < bound, not a range check from intoByteBuffer.
> > >
> > > Kind regards,
> > > Rado
> > >
> > > Od: Paul Sandoz <paul.sandoz at oracle.com>
> > > Wysłane: poniedziałek, 21 czerwca 2021 23:25
> > > Do: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> > > DW: Radosław Smogura <mail at smogura.eu>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> > > Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
> > >
> > > Replacing the upper bound in `segmentImplicitScalar` with a constant (1024 say) results in a similar time to `bufferNativeScalar` without a constant bound, both of which (alas) are still slower that scalar array access (which benefits greatly from auto-vectorization).
> > >
> > > I wonder if the segment subrange checking for int value ranges is having an impact on bounds checking?
> > >
> > > Paul.
> > >
> > > > On Jun 21, 2021, at 1:56 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> > > >
> > > >
> > > > On 21/06/2021 20:33, Paul Sandoz wrote:
> > > >> - Segment scalar access is penalized compared to ByteBuffer (from allocate or allocateDirect) scalar access.
> > > >
> > > > Odd
> > > >
> > > > We have many benchmarks similar to this (see LoopOverNonConstant) and they seem to offer same level of performance compared with ByteBuffers.
> > > >
> > > > I wonder if the loop limit being "SPECIES.loopBound(srcArray.length)" plays a role? Have you tried replacing that expression with a constant?
> > > >
> > > > Maurizio
> > > >



More information about the panama-dev mailing list