ODP: Foreign + Vectors - benchmarks for copying and swapping
Radosław Smogura
mail at smogura.eu
Mon Jun 21 01:48:59 UTC 2021
Hi,
I think I found what's going on.
https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321
[https://opengraph.githubassets.com/fbd69fd61e8db93b4a70ce6baee936b84348a65b3e899b3e5d7591322aded370/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321]<https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321>
Remove mem barriers after vector ops to increase performance · rsmogura/panama-foreign at 64bf7a6<https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321>
Removing mem barriers increase performance by allowing loop to unroll. Other possible solution: allow traversal through mem barriers, and projections when finding loop back control. After ``` Ben...
github.com
Other approach would be to modify loop optimizations to find back control through chain of projections and mem barriersa (or maybe both?)
This idea makes copy much better (and even faster then native one in unrolled
VectorCopySegments.copyWithNative 1024 avgt 10 20.293 ? 0.436 ns/op
VectorCopySegments.copyWithNative 1048576 avgt 10 22270.840 ? 579.533 ns/op
VectorCopySegments.copyWithNativeShared 1024 avgt 10 15.854 ? 0.061 ns/op
VectorCopySegments.copyWithNativeShared 1048576 avgt 10 21948.236 ? 43.981 ns/op
VectorCopySegments.copyWithNativeToArray 1024 avgt 10 20.318 ? 0.347 ns/op
VectorCopySegments.copyWithNativeToArray 1048576 avgt 10 22142.499 ? 305.501 ns/op
VectorCopySegments.copyWithVector 1024 avgt 10 31.240 ? 0.333 ns/op
VectorCopySegments.copyWithVector 1048576 avgt 10 25320.898 ? 118.397 ns/op
VectorCopySegments.copyWithVectorDirectBuffer 1024 avgt 10 21.605 ? 0.210 ns/op
VectorCopySegments.copyWithVectorDirectBuffer 1048576 avgt 10 23613.272 ? 1030.153 ns/op
VectorCopySegments.copyWithVectorShared 1024 avgt 10 19.897 ? 0.485 ns/op
VectorCopySegments.copyWithVectorShared 1048576 avgt 10 24719.767 ? 453.725 ns/op
VectorCopySegments.copyWithVectorShuffle 1024 avgt 10 36.364 ? 0.669 ns/op
VectorCopySegments.copyWithVectorShuffle 1048576 avgt 10 29730.528 ? 339.100 ns/op
VectorCopySegments.copyWithVectorToArray 1024 avgt 10 29.282 ? 0.338 ns/op
VectorCopySegments.copyWithVectorToArray 1048576 avgt 10 28502.004 ? 593.347 ns/op
VectorCopySegments.copyWithVectorUnroller 1024 avgt 10 36.368 ? 0.092 ns/op
VectorCopySegments.copyWithVectorUnroller 1048576 avgt 10 21528.433 ? 303.141 ns/op
Kind regards,
Rado
________________________________
Od: panama-dev <panama-dev-retn at openjdk.java.net> w imieniu użytkownika Radosław Smogura <mail at smogura.eu>
Wysłane: sobota, 19 czerwca 2021 01:12
Do: Paul Sandoz <paul.sandoz at oracle.com>; Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
DW: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: ODP: Foreign + Vectors - benchmarks for copying and swapping
Hi all,
So I have one more interesting thing, when I change loopBound to
VectorIntrinsics.roundDown(length, laneCount) - (laneCount - 1)
(I think it's better optimization)
I have such results (please take a look at drop down in avg time) - that's for 1m size
# Warmup Iteration 2: 47990.588 ns/op
# Warmup Iteration 3: 46073.341 ns/op
# Warmup Iteration 4: 45593.405 ns/op
# Warmup Iteration 5: 45525.001 ns/op
Iteration 1: 45921.159 ns/op
Iteration 2: 46542.631 ns/op
Iteration 3: 45532.379 ns/op
Iteration 4: 46862.923 ns/op
Iteration 5: 49324.919 ns/op
Iteration 6: 34099.454 ns/op
Iteration 7: 22315.402 ns/op
Iteration 8: 22495.426 ns/op
Iteration 9: 22702.834 ns/op
Iteration 10: 22675.853 ns/op
Result "org.openjdk.bench.jdk.incubator.foreign.VectorCopySegments.copyWithVectorBuff":
35847.298 ?(99.9%) 18332.841 ns/op [Average]
(min, avg, max) = (22315.402, 35847.298, 49324.919), stdev = 12126.039
CI (99.9%): [17514.457, 54180.139] (assumes normal distribution)
Kind regards,
Rado
________________________________
Od: Paul Sandoz <paul.sandoz at oracle.com>
Wysłane: piątek, 18 czerwca 2021 23:37
Do: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
DW: Radosław Smogura <mail at smogura.eu>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
> On Jun 18, 2021, at 2:03 PM, Maurizio Cimadamore <Maurizio.Cimadamore at Oracle.COM> wrote:
>
>
> On 18/06/2021 20:55, Paul Sandoz wrote:
>> The order declared in the vector load/store overrides any order declared on the buffer (should make the specification clearer in that respect). (In this case in the source is bytes, so there is no swapping).
> Doh - right!
>>
>> —
>>
>> There is something odd going on when tiered compilation is switched off, the result for copyWithVector is much worse for smaller sizes.
>
> Is this what Uwe is seeing I wonder?
>
> https://github.com/apache/lucene/pull/177#issuecomment-861265227
>
Possibly.
>>
>> With larger sizes with and without tiered, similar result are observed with similar generated code (of less quality than with tiered for smaller sizes, oddly enough).
>>
>> Whether tiered is enabled or not there is no loop unrolling.
>>
>> I think something may have regressed, although we have previously focused more on array access than buffer access.
> Is the vector implementation performing a bulk copy into a byte array IIRC? If so, maybe there's an issue with bulk copy - which would be the same issue we're seeing on the memory access front?
No, the intrinsic byte vector access to a byte buffer works similarly to intrinsic byte vector access to a byte array, using the buffer’s base and offset (to calculate the address relative to the base).
Paul.
More information about the panama-dev
mailing list