ODP: Foreign + Vectors - benchmarks for copying and swapping

Mon Jun 21 01:48:59 UTC 2021

Hi,

I think I found what's going on.

https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321
[https://opengraph.githubassets.com/fbd69fd61e8db93b4a70ce6baee936b84348a65b3e899b3e5d7591322aded370/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321]<https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321>
Remove mem barriers after vector ops to increase performance · rsmogura/panama-foreign at 64bf7a6<https://github.com/rsmogura/panama-foreign/commit/64bf7a66a6c3cf5c4ee4f3d0b0e29128bc8e1321>
Removing mem barriers increase performance by allowing loop to unroll. Other possible solution: allow traversal through mem barriers, and projections when finding loop back control. After ``` Ben...
github.com

Other approach would be to modify loop optimizations to find back control through chain of projections and mem barriersa (or maybe both?)

This idea makes copy much better (and even faster then native one in unrolled

VectorCopySegments.copyWithNative                 1024  avgt   10     20.293 ?    0.436  ns/op
VectorCopySegments.copyWithNative              1048576  avgt   10  22270.840 ?  579.533  ns/op
VectorCopySegments.copyWithNativeShared           1024  avgt   10     15.854 ?    0.061  ns/op
VectorCopySegments.copyWithNativeShared        1048576  avgt   10  21948.236 ?   43.981  ns/op
VectorCopySegments.copyWithNativeToArray          1024  avgt   10     20.318 ?    0.347  ns/op
VectorCopySegments.copyWithNativeToArray       1048576  avgt   10  22142.499 ?  305.501  ns/op
VectorCopySegments.copyWithVector                 1024  avgt   10     31.240 ?    0.333  ns/op
VectorCopySegments.copyWithVector              1048576  avgt   10  25320.898 ?  118.397  ns/op
VectorCopySegments.copyWithVectorDirectBuffer     1024  avgt   10     21.605 ?    0.210  ns/op
VectorCopySegments.copyWithVectorDirectBuffer  1048576  avgt   10  23613.272 ? 1030.153  ns/op
VectorCopySegments.copyWithVectorShared           1024  avgt   10     19.897 ?    0.485  ns/op
VectorCopySegments.copyWithVectorShared        1048576  avgt   10  24719.767 ?  453.725  ns/op
VectorCopySegments.copyWithVectorShuffle          1024  avgt   10     36.364 ?    0.669  ns/op
VectorCopySegments.copyWithVectorShuffle       1048576  avgt   10  29730.528 ?  339.100  ns/op
VectorCopySegments.copyWithVectorToArray          1024  avgt   10     29.282 ?    0.338  ns/op
VectorCopySegments.copyWithVectorToArray       1048576  avgt   10  28502.004 ?  593.347  ns/op
VectorCopySegments.copyWithVectorUnroller         1024  avgt   10     36.368 ?    0.092  ns/op
VectorCopySegments.copyWithVectorUnroller      1048576  avgt   10  21528.433 ?  303.141  ns/op

Kind regards,
Rado

________________________________
Od: panama-dev <panama-dev-retn at openjdk.java.net> w imieniu użytkownika Radosław Smogura <mail at smogura.eu>
Wysłane: sobota, 19 czerwca 2021 01:12
Do: Paul Sandoz <paul.sandoz at oracle.com>; Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
DW: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: ODP: Foreign + Vectors - benchmarks for copying and swapping

Hi all,

So I have one more interesting thing, when I change loopBound to
VectorIntrinsics.roundDown(length, laneCount) - (laneCount - 1)
(I think it's better optimization)
I have such results (please take a look at drop down in avg time) - that's for 1m size

# Warmup Iteration   2: 47990.588 ns/op
# Warmup Iteration   3: 46073.341 ns/op
# Warmup Iteration   4: 45593.405 ns/op
# Warmup Iteration   5: 45525.001 ns/op
Iteration   1: 45921.159 ns/op
Iteration   2: 46542.631 ns/op
Iteration   3: 45532.379 ns/op
Iteration   4: 46862.923 ns/op
Iteration   5: 49324.919 ns/op
Iteration   6: 34099.454 ns/op
Iteration   7: 22315.402 ns/op
Iteration   8: 22495.426 ns/op
Iteration   9: 22702.834 ns/op
Iteration  10: 22675.853 ns/op

Result "org.openjdk.bench.jdk.incubator.foreign.VectorCopySegments.copyWithVectorBuff":
  35847.298 ?(99.9%) 18332.841 ns/op [Average]
  (min, avg, max) = (22315.402, 35847.298, 49324.919), stdev = 12126.039
  CI (99.9%): [17514.457, 54180.139] (assumes normal distribution)

Kind regards,
Rado
________________________________
Od: Paul Sandoz <paul.sandoz at oracle.com>
Wysłane: piątek, 18 czerwca 2021 23:37
Do: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
DW: Radosław Smogura <mail at smogura.eu>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: Re: Foreign + Vectors - benchmarks for copying and swapping

> On Jun 18, 2021, at 2:03 PM, Maurizio Cimadamore <Maurizio.Cimadamore at Oracle.COM> wrote:
>
>
> On 18/06/2021 20:55, Paul Sandoz wrote:
>> The order declared in the vector load/store overrides any order declared on the buffer (should make the specification clearer in that respect). (In this case in the source is bytes, so there is no swapping).
> Doh - right!
>>
>> —
>>
>> There is something odd going on when tiered compilation is switched off, the result for copyWithVector is much worse for smaller sizes.
>
> Is this what Uwe is seeing I wonder?
>
> https://github.com/apache/lucene/pull/177#issuecomment-861265227
>

Possibly.

>>
>> With larger sizes with and without tiered, similar result are observed with similar generated code (of less quality than with tiered for smaller sizes, oddly enough).
>>
>> Whether tiered is enabled or not there is no loop unrolling.
>>
>> I think something may have regressed, although we have previously focused more on array access than buffer access.
> Is the vector implementation performing a bulk copy into a byte array IIRC? If so, maybe there's an issue with bulk copy - which would be the same issue we're seeing on the memory access front?

No, the intrinsic byte vector access to a byte buffer works similarly to intrinsic byte vector access to a byte array, using the buffer’s base and offset (to calculate the address relative to the base).

Paul.