ODP: Foreign + Vectors - benchmarks for copying and swapping
Radosław Smogura
mail at smogura.eu
Thu Jun 17 22:14:34 UTC 2021
Hi Maurizio & Paul,
I pushed the changed benchmarks, and added tests with different copy sizes.
Direct byte buffer gives bigger times (I guess size is not compile time constant) - I can't find good way to make tests simillar. I think shared buffer higher times can be a same case.
For endianess, I'm not sure how to check it, but it may not be a case, I try to use native order, and operate on bytes.
Additonally,
In test copyWithVectorUnroller when I uncomment this lines before entering loop
final var v = ByteVector.fromByteBuffer(BYTE_VECTOR_SPECIES, src, i + 0 * lanes, ByteOrder.nativeOrder());
v.intoByteBuffer(dst, i + 0 * lanes, ByteOrder.nativeOrder());
results are much better for 1M data copy case
Benchmark (size) Mode Cnt Score Error Units
VectorCopySegments.copyWithVectorUnroller 1024 avgt 10 168.602 ? 1.260 ns/op
VectorCopySegments.copyWithVectorUnroller 1048576 avgt 10 60381.901 ? 374.981 ns/op
(missed loop peeling due to too big loop?)
And that's pitty that shuffle can't reverse bytes order in int in performant way...
---
Benchmark (size) Mode Cnt Score Error Units
VectorCopySegments.copyWithNative 1024 avgt 10 20.098 ? 0.392 ns/op
VectorCopySegments.copyWithNative 1048576 avgt 10 22211.497 ? 257.794 ns/op
VectorCopySegments.copyWithNativeShared 1024 avgt 10 15.728 ? 0.133 ns/op
VectorCopySegments.copyWithNativeShared 1048576 avgt 10 22066.399 ? 214.858 ns/op
VectorCopySegments.copyWithNativeToArray 1024 avgt 10 20.298 ? 0.110 ns/op
VectorCopySegments.copyWithNativeToArray 1048576 avgt 10 21862.090 ? 268.311 ns/op
VectorCopySegments.copyWithVector 1024 avgt 10 30.841 ? 0.119 ns/op
VectorCopySegments.copyWithVector 1048576 avgt 10 44834.639 ? 746.965 ns/op
VectorCopySegments.copyWithVectorDirectBuffer 1024 avgt 10 44.977 ? 0.571 ns/op
VectorCopySegments.copyWithVectorDirectBuffer 1048576 avgt 10 49211.524 ? 2602.650 ns/op
VectorCopySegments.copyWithVectorShared 1024 avgt 10 56.082 ? 0.619 ns/op
VectorCopySegments.copyWithVectorShared 1048576 avgt 10 62226.674 ? 1690.624 ns/op
VectorCopySegments.copyWithVectorShuffle 1024 avgt 10 47.797 ? 0.672 ns/op
VectorCopySegments.copyWithVectorShuffle 1048576 avgt 10 61171.416 ? 5656.479 ns/op
VectorCopySegments.copyWithVectorToArray 1024 avgt 10 31.070 ? 0.736 ns/op
VectorCopySegments.copyWithVectorToArray 1048576 avgt 10 38653.328 ? 489.896 ns/op
VectorCopySegments.copyWithVectorUnroller 1024 avgt 10 49.522 ? 1.027 ns/op
VectorCopySegments.copyWithVectorUnroller 1048576 avgt 10 72145.653 ? 987.111 ns/op
Kind regards,
Rado
________________________________
Od: Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
Wysłane: czwartek, 17 czerwca 2021 18:12
Do: Paul Sandoz <paul.sandoz at oracle.com>; Radosław Smogura <mail at smogura.eu>
DW: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Temat: Re: Foreign + Vectors - benchmarks for copying and swapping
Would be interesting to see if passing in regular byte buffers (e.g. not
derived from segments) improve things. A regular byte buffer should not
have any liveness check, so the overhead might be somewhat lower,
although it seems, as Paul says, that the benchmark is affected by
non-optimized bound checks.
Also - maybe it's a silly comment - but did you double check the
endianness of the returned buffer? The memory segment API returns
BIG_ENDIAN buffers (as it's the case for buffers allocated with
ByteBuffer.allocateDirect). Is it possible you are using mismatched
endiannes?
Maurizio
On 17/06/2021 17:00, Paul Sandoz wrote:
> Hi Rado,
>
> Thanks, an interesting experiment.
>
> We would need to look the generated code to spot issues. Hard to compete with a specialized and highly optimized intrinsic copy. Still, it seems we should be able to do better.
>
> I suspect there might be some un-hoisted bounds checks, or non-optimal addressing of loads/stores.
>
> Something odd going on with shared access.
>
> I doubt the use of shuffle can compete in general with the byte swapping in the intrinsic copy, but we are still ironing out performance issues with shuffle so maybe there is some room for improvement. (Further, we don’t do anything special with for certain constant shuffle patterns from which we might be able to select more optimal instructions.)
>
> Paul.
>
>> On Jun 16, 2021, at 5:15 PM, Radosław Smogura <mail at smogura.eu> wrote:
>>
>> Hi all,
>>
>> I could not stop my self, from this simple experiment of glueing together foreign with vectors, at least via byte buffers for now.
>>
>> Results are not the best, but still could be interesting, as there was some interest with this.
>>
>> Below please find results, and the link to benchmark:
>>
>> Benchmark Mode Cnt Score Error Units
>> VectorCopySegments.copyWithNative avgt 10 20.987 ? 1.819 ns/op
>> VectorCopySegments.copyWithNativeShared avgt 10 12.528 ? 0.183 ns/op
>> VectorCopySegments.copyWithNativeToArray avgt 10 19.800 ? 3.985 ns/op
>> VectorCopySegments.copyWithVector avgt 10 31.151 ? 1.929 ns/op
>> VectorCopySegments.copyWithVectorShared avgt 10 56.752 ? 1.754 ns/op
>> VectorCopySegments.copyWithVectorShuffle avgt 10 52.409 ? 0.390 ns/op
>> VectorCopySegments.copyWithVectorToArray avgt 10 29.573 ? 0.485 ns/op
>>
>> https://github.com/rsmogura/panama-foreign/blob/foreign_and_vectors/test/micro/org/openjdk/bench/jdk/incubator/foreign/VectorCopySegments.java
>>
>> Feedback is welcome.
>>
>> Kind regards,
>> Rado
More information about the panama-dev
mailing list