ODP: Foreign + Vectors - benchmarks for copying and swapping

Fri Jun 18 23:12:34 UTC 2021

Hi all,

So I have one more interesting thing, when I change loopBound to
VectorIntrinsics.roundDown(length, laneCount) - (laneCount - 1)
(I think it's better optimization)
I have such results (please take a look at drop down in avg time) - that's for 1m size

# Warmup Iteration   2: 47990.588 ns/op
# Warmup Iteration   3: 46073.341 ns/op
# Warmup Iteration   4: 45593.405 ns/op
# Warmup Iteration   5: 45525.001 ns/op
Iteration   1: 45921.159 ns/op
Iteration   2: 46542.631 ns/op
Iteration   3: 45532.379 ns/op
Iteration   4: 46862.923 ns/op
Iteration   5: 49324.919 ns/op
Iteration   6: 34099.454 ns/op
Iteration   7: 22315.402 ns/op
Iteration   8: 22495.426 ns/op
Iteration   9: 22702.834 ns/op
Iteration  10: 22675.853 ns/op

Result "org.openjdk.bench.jdk.incubator.foreign.VectorCopySegments.copyWithVectorBuff":
  35847.298 ?(99.9%) 18332.841 ns/op [Average]
  (min, avg, max) = (22315.402, 35847.298, 49324.919), stdev = 12126.039
  CI (99.9%): [17514.457, 54180.139] (assumes normal distribution)

Kind regards,
Rado
________________________________
Od: Paul Sandoz <paul.sandoz at oracle.com>
Wysłane: piątek, 18 czerwca 2021 23:37
Do: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
DW: Radosław Smogura <mail at smogura.eu>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: Re: Foreign + Vectors - benchmarks for copying and swapping

> On Jun 18, 2021, at 2:03 PM, Maurizio Cimadamore <Maurizio.Cimadamore at Oracle.COM> wrote:
>
>
> On 18/06/2021 20:55, Paul Sandoz wrote:
>> The order declared in the vector load/store overrides any order declared on the buffer (should make the specification clearer in that respect). (In this case in the source is bytes, so there is no swapping).
> Doh - right!
>>
>> —
>>
>> There is something odd going on when tiered compilation is switched off, the result for copyWithVector is much worse for smaller sizes.
>
> Is this what Uwe is seeing I wonder?
>
> https://github.com/apache/lucene/pull/177#issuecomment-861265227
>

Possibly.

>>
>> With larger sizes with and without tiered, similar result are observed with similar generated code (of less quality than with tiered for smaller sizes, oddly enough).
>>
>> Whether tiered is enabled or not there is no loop unrolling.
>>
>> I think something may have regressed, although we have previously focused more on array access than buffer access.
> Is the vector implementation performing a bulk copy into a byte array IIRC? If so, maybe there's an issue with bulk copy - which would be the same issue we're seeing on the memory access front?

No, the intrinsic byte vector access to a byte buffer works similarly to intrinsic byte vector access to a byte array, using the buffer’s base and offset (to calculate the address relative to the base).

Paul.