ODP: Foreign + Vectors - benchmarks for copying and swapping

Thu Jun 17 22:14:34 UTC 2021

Hi Maurizio & Paul,

I pushed the changed benchmarks, and added tests with different copy sizes.

Direct byte buffer gives bigger times (I guess size is not compile time constant) - I can't find good way to make tests simillar. I think shared buffer higher times can be a same case.

For endianess, I'm not sure how to check it, but it may not be a case, I try to use native order, and operate on bytes.

Additonally,

In test  copyWithVectorUnroller when I uncomment this lines before entering loop
      final var v = ByteVector.fromByteBuffer(BYTE_VECTOR_SPECIES, src, i + 0 * lanes, ByteOrder.nativeOrder());
      v.intoByteBuffer(dst, i + 0 * lanes, ByteOrder.nativeOrder());

results are much better for 1M data copy case

Benchmark                                   (size)  Mode  Cnt      Score     Error  Units
VectorCopySegments.copyWithVectorUnroller     1024  avgt   10    168.602 ?   1.260  ns/op
VectorCopySegments.copyWithVectorUnroller  1048576  avgt   10  60381.901 ? 374.981  ns/op

(missed loop peeling due to too big loop?)

And that's pitty that shuffle can't reverse bytes order in int in performant way...

---
Benchmark                                       (size)  Mode  Cnt      Score      Error  Units
VectorCopySegments.copyWithNative                 1024  avgt   10     20.098 ?    0.392  ns/op
VectorCopySegments.copyWithNative              1048576  avgt   10  22211.497 ?  257.794  ns/op
VectorCopySegments.copyWithNativeShared           1024  avgt   10     15.728 ?    0.133  ns/op
VectorCopySegments.copyWithNativeShared        1048576  avgt   10  22066.399 ?  214.858  ns/op
VectorCopySegments.copyWithNativeToArray          1024  avgt   10     20.298 ?    0.110  ns/op
VectorCopySegments.copyWithNativeToArray       1048576  avgt   10  21862.090 ?  268.311  ns/op
VectorCopySegments.copyWithVector                 1024  avgt   10     30.841 ?    0.119  ns/op
VectorCopySegments.copyWithVector              1048576  avgt   10  44834.639 ?  746.965  ns/op
VectorCopySegments.copyWithVectorDirectBuffer     1024  avgt   10     44.977 ?    0.571  ns/op
VectorCopySegments.copyWithVectorDirectBuffer  1048576  avgt   10  49211.524 ? 2602.650  ns/op
VectorCopySegments.copyWithVectorShared           1024  avgt   10     56.082 ?    0.619  ns/op
VectorCopySegments.copyWithVectorShared        1048576  avgt   10  62226.674 ? 1690.624  ns/op
VectorCopySegments.copyWithVectorShuffle          1024  avgt   10     47.797 ?    0.672  ns/op
VectorCopySegments.copyWithVectorShuffle       1048576  avgt   10  61171.416 ? 5656.479  ns/op
VectorCopySegments.copyWithVectorToArray          1024  avgt   10     31.070 ?    0.736  ns/op
VectorCopySegments.copyWithVectorToArray       1048576  avgt   10  38653.328 ?  489.896  ns/op
VectorCopySegments.copyWithVectorUnroller         1024  avgt   10     49.522 ?    1.027  ns/op
VectorCopySegments.copyWithVectorUnroller      1048576  avgt   10  72145.653 ?  987.111  ns/op

Kind regards,
Rado
________________________________
Od: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
Wysłane: czwartek, 17 czerwca 2021 18:12
Do: Paul Sandoz <paul.sandoz at oracle.com>; Radosław Smogura <mail at smogura.eu>
DW: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: Re: Foreign + Vectors - benchmarks for copying and swapping

Would be interesting to see if passing in regular byte buffers (e.g. not
derived from segments) improve things. A regular byte buffer should not
have any liveness check, so the overhead might be somewhat lower,
although it seems, as Paul says, that the benchmark is affected by
non-optimized bound checks.

Also - maybe it's a silly comment - but did you double check the
endianness of the returned buffer? The memory segment API returns
BIG_ENDIAN buffers (as it's the case for buffers allocated with
ByteBuffer.allocateDirect). Is it possible you are using mismatched
endiannes?

Maurizio

On 17/06/2021 17:00, Paul Sandoz wrote:
> Hi Rado,
>
> Thanks, an interesting experiment.
>
> We would need to look the generated code to spot issues. Hard to compete with a specialized and highly optimized intrinsic copy. Still, it seems we should be able to do better.
>
> I suspect there might be some un-hoisted bounds checks, or non-optimal addressing of loads/stores.
>
> Something odd going on with shared access.
>
> I doubt the use of shuffle can compete in general with the byte swapping in the intrinsic copy, but we are still ironing out performance issues with shuffle so maybe there is some room for improvement. (Further, we don’t do anything special with for certain constant shuffle patterns from which we might be able to select more optimal instructions.)
>
> Paul.
>
>> On Jun 16, 2021, at 5:15 PM, Radosław Smogura <mail at smogura.eu> wrote:
>>
>> Hi all,
>>
>> I could not stop my self, from this simple experiment of glueing together foreign with vectors, at least via byte buffers for now.
>>
>> Results are not the best, but still could be interesting, as there was some interest with this.
>>
>> Below please find results, and the link to benchmark:
>>
>> Benchmark                                 Mode  Cnt   Score   Error  Units
>> VectorCopySegments.copyWithNative         avgt   10  20.987 ? 1.819  ns/op
>> VectorCopySegments.copyWithNativeShared   avgt   10  12.528 ? 0.183  ns/op
>> VectorCopySegments.copyWithNativeToArray  avgt   10  19.800 ? 3.985  ns/op
>> VectorCopySegments.copyWithVector         avgt   10  31.151 ? 1.929  ns/op
>> VectorCopySegments.copyWithVectorShared   avgt   10  56.752 ? 1.754  ns/op
>> VectorCopySegments.copyWithVectorShuffle  avgt   10  52.409 ? 0.390  ns/op
>> VectorCopySegments.copyWithVectorToArray  avgt   10  29.573 ? 0.485  ns/op
>>
>> https://github.com/rsmogura/panama-foreign/blob/foreign_and_vectors/test/micro/org/openjdk/bench/jdk/incubator/foreign/VectorCopySegments.java
>>
>> Feedback is welcome.
>>
>> Kind regards,
>> Rado