Foreign + Vectors - benchmarks for copying and swapping

Mon Jun 21 19:33:30 UTC 2021

Ah! Nice find, I forgot about those barriers. We need to discuss with Vladimir, there are some subtle issues here (similar to that for Unsafe access).

I wrote my own benchmark to explore this is more detail:

  https://gist.github.com/PaulSandoz/b8b72e9c837cf6462d3b744a264f23c4

Results are in comments. Run against a recent build of github.com/openjdk/jdk/.

Some points of note:

- ByteBuffer and segment with vector access is penalized, I think due to the placement of CPU barriers, as you have found.
Disabling CPU barriers in vectorIntrinsics.cpp improves the performance, but, it’s still slower than array vector access as the address calculations are not as efficient.

- Array vector access does not result in unrolling, which is why for large inputs `array` is slower that `arrayScalar`.

- Segment scalar access is penalized compared to ByteBuffer (from allocate or allocateDirect) scalar access.

- There is some odd interaction going with tiered compilation and confined segment access, which is faster than non-tiered access.

—

Separately, I suspect we need to enhance the Vector API byte buffer access to ensure the access is scoped?

Paul.

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L64

> On Jun 20, 2021, at 6:48 PM, Radosław Smogura <mail at smogura.eu> wrote:
> 
> Hi,
> 
> I think I found what's going on.
> 
> https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321
> 
> Remove mem barriers after vector ops to increase performance · rsmogura/panama-foreign at 64bf7a6
> Removing mem barriers increase performance by allowing loop to unroll. Other possible solution: allow traversal through mem barriers, and projections when finding loop back control. After ``` Ben...
> github.com
> 
> Other approach would be to modify loop optimizations to find back control through chain of projections and mem barriersa (or maybe both?)
> 
> This idea makes copy much better (and even faster then native one in unrolled
> 
> VectorCopySegments.copyWithNative                 1024  avgt   10     20.293 ?    0.436  ns/op
> VectorCopySegments.copyWithNative              1048576  avgt   10  22270.840 ?  579.533  ns/op
> VectorCopySegments.copyWithNativeShared           1024  avgt   10     15.854 ?    0.061  ns/op
> VectorCopySegments.copyWithNativeShared        1048576  avgt   10  21948.236 ?   43.981  ns/op
> VectorCopySegments.copyWithNativeToArray          1024  avgt   10     20.318 ?    0.347  ns/op
> VectorCopySegments.copyWithNativeToArray       1048576  avgt   10  22142.499 ?  305.501  ns/op
> VectorCopySegments.copyWithVector                 1024  avgt   10     31.240 ?    0.333  ns/op
> VectorCopySegments.copyWithVector              1048576  avgt   10  25320.898 ?  118.397  ns/op
> VectorCopySegments.copyWithVectorDirectBuffer     1024  avgt   10     21.605 ?    0.210  ns/op
> VectorCopySegments.copyWithVectorDirectBuffer  1048576  avgt   10  23613.272 ? 1030.153  ns/op
> VectorCopySegments.copyWithVectorShared           1024  avgt   10     19.897 ?    0.485  ns/op
> VectorCopySegments.copyWithVectorShared        1048576  avgt   10  24719.767 ?  453.725  ns/op
> VectorCopySegments.copyWithVectorShuffle          1024  avgt   10     36.364 ?    0.669  ns/op
> VectorCopySegments.copyWithVectorShuffle       1048576  avgt   10  29730.528 ?  339.100  ns/op
> VectorCopySegments.copyWithVectorToArray          1024  avgt   10     29.282 ?    0.338  ns/op
> VectorCopySegments.copyWithVectorToArray       1048576  avgt   10  28502.004 ?  593.347  ns/op
> VectorCopySegments.copyWithVectorUnroller         1024  avgt   10     36.368 ?    0.092  ns/op
> VectorCopySegments.copyWithVectorUnroller      1048576  avgt   10  21528.433 ?  303.141  ns/op
> 
> 
> Kind regards,
> Rado
> 
> Od: panama-dev <panama-dev-retn at openjdk.java.net> w imieniu użytkownika Radosław Smogura <mail at smogura.eu>
> Wysłane: sobota, 19 czerwca 2021 01:12
> Do: Paul Sandoz <paul.sandoz at oracle.com>; Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> DW: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> Temat: ODP: Foreign + Vectors - benchmarks for copying and swapping
>  
> Hi all,
> 
> So I have one more interesting thing, when I change loopBound to
> VectorIntrinsics.roundDown(length, laneCount) - (laneCount - 1)
> (I think it's better optimization)
> I have such results (please take a look at drop down in avg time) - that's for 1m size
> 
> # Warmup Iteration   2: 47990.588 ns/op
> # Warmup Iteration   3: 46073.341 ns/op
> # Warmup Iteration   4: 45593.405 ns/op
> # Warmup Iteration   5: 45525.001 ns/op
> Iteration   1: 45921.159 ns/op
> Iteration   2: 46542.631 ns/op
> Iteration   3: 45532.379 ns/op
> Iteration   4: 46862.923 ns/op
> Iteration   5: 49324.919 ns/op
> Iteration   6: 34099.454 ns/op
> Iteration   7: 22315.402 ns/op
> Iteration   8: 22495.426 ns/op
> Iteration   9: 22702.834 ns/op
> Iteration  10: 22675.853 ns/op
> 
> 
> Result "org.openjdk.bench.jdk.incubator.foreign.VectorCopySegments.copyWithVectorBuff":
>   35847.298 ?(99.9%) 18332.841 ns/op [Average]
>   (min, avg, max) = (22315.402, 35847.298, 49324.919), stdev = 12126.039
>   CI (99.9%): [17514.457, 54180.139] (assumes normal distribution)
> 
> Kind regards,
> Rado
> ________________________________
> Od: Paul Sandoz <paul.sandoz at oracle.com>
> Wysłane: piątek, 18 czerwca 2021 23:37
> Do: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> DW: Radosław Smogura <mail at smogura.eu>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
> 
> 
> 
> > On Jun 18, 2021, at 2:03 PM, Maurizio Cimadamore <Maurizio.Cimadamore at Oracle.COM> wrote:
> >
> >
> > On 18/06/2021 20:55, Paul Sandoz wrote:
> >> The order declared in the vector load/store overrides any order declared on the buffer (should make the specification clearer in that respect). (In this case in the source is bytes, so there is no swapping).
> > Doh - right!
> >>
> >> —
> >>
> >> There is something odd going on when tiered compilation is switched off, the result for copyWithVector is much worse for smaller sizes.
> >
> > Is this what Uwe is seeing I wonder?
> >
> > https://github.com/apache/lucene/pull/177#issuecomment-861265227
> >
> 
> Possibly.
> 
> 
> >>
> >> With larger sizes with and without tiered, similar result are observed with similar generated code (of less quality than with tiered for smaller sizes, oddly enough).
> >>
> >> Whether tiered is enabled or not there is no loop unrolling.
> >>
> >> I think something may have regressed, although we have previously focused more on array access than buffer access.
> > Is the vector implementation performing a bulk copy into a byte array IIRC? If so, maybe there's an issue with bulk copy - which would be the same issue we're seeing on the memory access front?
> 
> 
> No, the intrinsic byte vector access to a byte buffer works similarly to intrinsic byte vector access to a byte array, using the buffer’s base and offset (to calculate the address relative to the base).
> 
> Paul.