Foreign + Vectors - benchmarks for copying and swapping

Thu Jul 1 21:25:43 UTC 2021

Paul,

I did few other checks, and unrolling does not work for short arrays, too.

As the maximum stride size is 8, unroll will work for AVX 256, but int arrays with AVX 512 may not get unrolled (unfortuantely I don't have way to test it).

I think I'll send this change to hotspot team as PR, and let's see what they tell about stride size.

Kind regards,
Rado

________________________________
Od: Radosław Smogura w imieniu użytkownika Radosław Smogura <mail at smogura.eu>
Wysłane: piątek, 25 czerwca 2021 03:51
Do: Radosław Smogura <mail at smogura.eu>; Paul Sandoz <paul.sandoz at oracle.com>
DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: ODP: Foreign + Vectors - benchmarks for copying and swapping

Hi,

I think I've found this condition. Loops limits unroll based on stride size.

Made small change for fun

https://github.com/rsmogura/panama-foreign/commit/0b991797a1e647a2f4a960cf6253708139d57975

Before
Benchmark                       (size)  Mode  Cnt      Score     Error  Units
TestLoadStoreShort.bufferHeap  1048576  avgt   10  22010.657 ? 127.671  ns/op

After MICRO="OPTIONS=-f 1 -p size=1048576 -jvmArgs=-XX:+UnlockExperimentalVMOptions -jvmArgs=-XX:+LoopUnrollAggressiveStrideLimit"
Benchmark                       (size)  Mode  Cnt      Score     Error  Units
TestLoadStoreShort.bufferHeap  1048576  avgt   10  22000.258 ? 482.501  ns/op

I'm not sure why there's limit for stride size. I only found [1] related to this code, which made stride limit more tight, but I'm not sure what stands behind this limit.

[1] https://github.com/openjdk/panama-foreign/commit/2683d5390bd58683ae13bdd8582127c308d8fd04

BR,
Rado
________________________________
Od: panama-dev <panama-dev-retn at openjdk.java.net> w imieniu użytkownika Radosław Smogura <mail at smogura.eu>
Wysłane: czwartek, 24 czerwca 2021 23:38
Do: Paul Sandoz <paul.sandoz at oracle.com>
DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: ODP: Foreign + Vectors - benchmarks for copying and swapping

Hi,

That's a good point - worth checking.

For now I found with debugger may not be the best due to OSR

For TestLoadStoreInts running it with -XX:-UseOnStackReplacement -jvmArgs=-XX:-UseLoopCounter does not give significant changes. For smaller regions it's even slower (but this can be due to benchamrk logic in JMH it self - I guess there's some loop which will not get compiled).

However for tests VectorCopySegments.copyWithVector there's change  25521.137 -> 23148.839 ns/op. I guess it's because the method gets fully compiled.

BR,
Rado
________________________________
Od: Paul Sandoz <paul.sandoz at oracle.com>
Wysłane: czwartek, 24 czerwca 2021 17:26
Do: Radosław Smogura <mail at smogura.eu>
DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: Re: Foreign + Vectors - benchmarks for copying and swapping

Thanks for checking other types. A curious restriction! Perhaps we can gain more insight by tracing with a fast debug build? e.g. see trace usages in src/hotspot/share/opto/loopTransform.cpp

Paul.

> On Jun 23, 2021, at 3:23 PM, Radosław Smogura <mail at smogura.eu> wrote:
>
> Hi Paul,
>
> Thank you.
>
> I checked with LongVector and got same results as int (4x loop unroll). With short vector I could not unroll. I only focused on method bufferHeap.
>
> Kind regards,
> Rado
>
> Od: Paul Sandoz <paul.sandoz at oracle.com>
> Wysłane: środa, 23 czerwca 2021 22:32
> Do: Radosław Smogura <mail at smogura.eu>
> DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
>
> Here you go:
>
> https://gist.github.com/PaulSandoz/7d95a4d9b99b5f9f9c6326f65a4d77c8
>
> Details in comments at the end.
>
> Paul.
>
> > On Jun 23, 2021, at 11:41 AM, Radosław Smogura <mail at smogura.eu> wrote:
> >
> > Hi Paul,
> >
> > Can you share a code, I could not unroll loop. I can only eliminate range checks and that's all.
> >
> > In fact it's bit odd, as the code for loading int and byte vectors looks like same.
> >
> > I've got few suspicions why ByteBuffer vectors can be harder to optimize:
> >        • array length is taken from constant memory
> >        • array length is non-negative
> >
> > Kind regards,
> > Rado
> >
> > Od: Paul Sandoz <paul.sandoz at oracle.com>
> > Wysłane: wtorek, 22 czerwca 2021 22:29
> > Do: Radosław Smogura <mail at smogura.eu>
> > DW: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> > Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
> >
> > In general that should be ok. Try using IntVector instead and it will unroll (with your patch removing CPU barriers)
> >
> > I wonder if this may be a limitation specific to bytes.
> >
> > Paul.
> >
> > > On Jun 21, 2021, at 4:28 PM, Radosław Smogura <mail at smogura.eu> wrote:
> > >
> > > Hi,
> > >
> > > I think why the copy case may fail with unrolling, because
> > >        • loop unroll takes the range check from intoByteBuffer as the loop exit condition
> > >        • the range check uses unsigned compare, which is not supported by loop unroll
> > >
> > > I think in this code
> > >         for (int i = 0; i < bound; i += lanes) {
> > >           final var srcVector = ByteVector
> > >               .fromByteBuffer(BYTE_VECTOR_SPECIES, src, i, ByteOrder.nativeOrder());
> > >
> > >           srcVector.intoByteBuffer(dst, i, ByteOrder.nativeOrder());
> > >         }
> > > exit condition should be i < bound, not a range check from intoByteBuffer.
> > >
> > > Kind regards,
> > > Rado
> > >
> > > Od: Paul Sandoz <paul.sandoz at oracle.com>
> > > Wysłane: poniedziałek, 21 czerwca 2021 23:25
> > > Do: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> > > DW: Radosław Smogura <mail at smogura.eu>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> > > Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
> > >
> > > Replacing the upper bound in `segmentImplicitScalar` with a constant (1024 say) results in a similar time to `bufferNativeScalar` without a constant bound, both of which (alas) are still slower that scalar array access (which benefits greatly from auto-vectorization).
> > >
> > > I wonder if the segment subrange checking for int value ranges is having an impact on bounds checking?
> > >
> > > Paul.
> > >
> > > > On Jun 21, 2021, at 1:56 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> > > >
> > > >
> > > > On 21/06/2021 20:33, Paul Sandoz wrote:
> > > >> - Segment scalar access is penalized compared to ByteBuffer (from allocate or allocateDirect) scalar access.
> > > >
> > > > Odd
> > > >
> > > > We have many benchmarks similar to this (see LoopOverNonConstant) and they seem to offer same level of performance compared with ByteBuffers.
> > > >
> > > > I wonder if the loop limit being "SPECIES.loopBound(srcArray.length)" plays a role? Have you tried replacing that expression with a constant?
> > > >
> > > > Maurizio
> > > >