Foreign + Vectors - benchmarks for copying and swapping
Paul Sandoz
paul.sandoz at oracle.com
Mon Jun 21 19:33:30 UTC 2021
Ah! Nice find, I forgot about those barriers. We need to discuss with Vladimir, there are some subtle issues here (similar to that for Unsafe access).
I wrote my own benchmark to explore this is more detail:
https://gist.github.com/PaulSandoz/b8b72e9c837cf6462d3b744a264f23c4
Results are in comments. Run against a recent build of github.com/openjdk/jdk/.
Some points of note:
- ByteBuffer and segment with vector access is penalized, I think due to the placement of CPU barriers, as you have found.
Disabling CPU barriers in vectorIntrinsics.cpp improves the performance, but, it’s still slower than array vector access as the address calculations are not as efficient.
- Array vector access does not result in unrolling, which is why for large inputs `array` is slower that `arrayScalar`.
- Segment scalar access is penalized compared to ByteBuffer (from allocate or allocateDirect) scalar access.
- There is some odd interaction going with tiered compilation and confined segment access, which is faster than non-tiered access.
—
Separately, I suspect we need to enhance the Vector API byte buffer access to ensure the access is scoped?
Paul.
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shared/c2/barrierSetC2.cpp#L64
> On Jun 20, 2021, at 6:48 PM, Radosław Smogura <mail at smogura.eu> wrote:
>
> Hi,
>
> I think I found what's going on.
>
> https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321
>
> Remove mem barriers after vector ops to increase performance · rsmogura/panama-foreign at 64bf7a6
> Removing mem barriers increase performance by allowing loop to unroll. Other possible solution: allow traversal through mem barriers, and projections when finding loop back control. After ``` Ben...
> github.com
>
> Other approach would be to modify loop optimizations to find back control through chain of projections and mem barriersa (or maybe both?)
>
> This idea makes copy much better (and even faster then native one in unrolled
>
> VectorCopySegments.copyWithNative 1024 avgt 10 20.293 ? 0.436 ns/op
> VectorCopySegments.copyWithNative 1048576 avgt 10 22270.840 ? 579.533 ns/op
> VectorCopySegments.copyWithNativeShared 1024 avgt 10 15.854 ? 0.061 ns/op
> VectorCopySegments.copyWithNativeShared 1048576 avgt 10 21948.236 ? 43.981 ns/op
> VectorCopySegments.copyWithNativeToArray 1024 avgt 10 20.318 ? 0.347 ns/op
> VectorCopySegments.copyWithNativeToArray 1048576 avgt 10 22142.499 ? 305.501 ns/op
> VectorCopySegments.copyWithVector 1024 avgt 10 31.240 ? 0.333 ns/op
> VectorCopySegments.copyWithVector 1048576 avgt 10 25320.898 ? 118.397 ns/op
> VectorCopySegments.copyWithVectorDirectBuffer 1024 avgt 10 21.605 ? 0.210 ns/op
> VectorCopySegments.copyWithVectorDirectBuffer 1048576 avgt 10 23613.272 ? 1030.153 ns/op
> VectorCopySegments.copyWithVectorShared 1024 avgt 10 19.897 ? 0.485 ns/op
> VectorCopySegments.copyWithVectorShared 1048576 avgt 10 24719.767 ? 453.725 ns/op
> VectorCopySegments.copyWithVectorShuffle 1024 avgt 10 36.364 ? 0.669 ns/op
> VectorCopySegments.copyWithVectorShuffle 1048576 avgt 10 29730.528 ? 339.100 ns/op
> VectorCopySegments.copyWithVectorToArray 1024 avgt 10 29.282 ? 0.338 ns/op
> VectorCopySegments.copyWithVectorToArray 1048576 avgt 10 28502.004 ? 593.347 ns/op
> VectorCopySegments.copyWithVectorUnroller 1024 avgt 10 36.368 ? 0.092 ns/op
> VectorCopySegments.copyWithVectorUnroller 1048576 avgt 10 21528.433 ? 303.141 ns/op
>
>
> Kind regards,
> Rado
>
> Od: panama-dev <panama-dev-retn at openjdk.java.net> w imieniu użytkownika Radosław Smogura <mail at smogura.eu>
> Wysłane: sobota, 19 czerwca 2021 01:12
> Do: Paul Sandoz <paul.sandoz at oracle.com>; Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> DW: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> Temat: ODP: Foreign + Vectors - benchmarks for copying and swapping
>
> Hi all,
>
> So I have one more interesting thing, when I change loopBound to
> VectorIntrinsics.roundDown(length, laneCount) - (laneCount - 1)
> (I think it's better optimization)
> I have such results (please take a look at drop down in avg time) - that's for 1m size
>
> # Warmup Iteration 2: 47990.588 ns/op
> # Warmup Iteration 3: 46073.341 ns/op
> # Warmup Iteration 4: 45593.405 ns/op
> # Warmup Iteration 5: 45525.001 ns/op
> Iteration 1: 45921.159 ns/op
> Iteration 2: 46542.631 ns/op
> Iteration 3: 45532.379 ns/op
> Iteration 4: 46862.923 ns/op
> Iteration 5: 49324.919 ns/op
> Iteration 6: 34099.454 ns/op
> Iteration 7: 22315.402 ns/op
> Iteration 8: 22495.426 ns/op
> Iteration 9: 22702.834 ns/op
> Iteration 10: 22675.853 ns/op
>
>
> Result "org.openjdk.bench.jdk.incubator.foreign.VectorCopySegments.copyWithVectorBuff":
> 35847.298 ?(99.9%) 18332.841 ns/op [Average]
> (min, avg, max) = (22315.402, 35847.298, 49324.919), stdev = 12126.039
> CI (99.9%): [17514.457, 54180.139] (assumes normal distribution)
>
> Kind regards,
> Rado
> ________________________________
> Od: Paul Sandoz <paul.sandoz at oracle.com>
> Wysłane: piątek, 18 czerwca 2021 23:37
> Do: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> DW: Radosław Smogura <mail at smogura.eu>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
>
>
>
> > On Jun 18, 2021, at 2:03 PM, Maurizio Cimadamore <Maurizio.Cimadamore at Oracle.COM> wrote:
> >
> >
> > On 18/06/2021 20:55, Paul Sandoz wrote:
> >> The order declared in the vector load/store overrides any order declared on the buffer (should make the specification clearer in that respect). (In this case in the source is bytes, so there is no swapping).
> > Doh - right!
> >>
> >> —
> >>
> >> There is something odd going on when tiered compilation is switched off, the result for copyWithVector is much worse for smaller sizes.
> >
> > Is this what Uwe is seeing I wonder?
> >
> > https://github.com/apache/lucene/pull/177#issuecomment-861265227
> >
>
> Possibly.
>
>
> >>
> >> With larger sizes with and without tiered, similar result are observed with similar generated code (of less quality than with tiered for smaller sizes, oddly enough).
> >>
> >> Whether tiered is enabled or not there is no loop unrolling.
> >>
> >> I think something may have regressed, although we have previously focused more on array access than buffer access.
> > Is the vector implementation performing a bulk copy into a byte array IIRC? If so, maybe there's an issue with bulk copy - which would be the same issue we're seeing on the memory access front?
>
>
> No, the intrinsic byte vector access to a byte buffer works similarly to intrinsic byte vector access to a byte array, using the buffer’s base and offset (to calculate the address relative to the base).
>
> Paul.
More information about the panama-dev
mailing list