RFR: JDK-8270147: Increase stride size allowing unrolling more loops [v6]
Radoslaw Smogura
github.com+7535718+rsmogura at openjdk.java.net
Wed Jul 14 00:01:16 UTC 2021
On Tue, 13 Jul 2021 23:48:03 GMT, Radoslaw Smogura <github.com+7535718+rsmogura at openjdk.org> wrote:
>> # Description
>>
>> Increase allowed stride size for loop unrolling to the maximum vector size on runtime platform.
>>
>> The motivation for this change is discussion and research about unrolling vector (SIMD) loops. For vector usage, stride size depends on vector element type, and platform vector size. For AVX256 and int stride size is 8, and loop unroll happens. However short and byte loops could not get unrolled (stride size 16 & 32):
>>
>> for (int i = 0; i < SPECIES.loopBound(longSize); i += SPECIES.length() /* 8 for int, 16 for short */ ) {
>> var v = ShortVector.fromByteBuffer(SPECIES, srcBufferHeap, i, ByteOrder.nativeOrder());
>> v.intoByteBuffer(dstBufferHeap, i, ByteOrder.nativeOrder());
>> }
>>
>> After this change, the maximum stride, which allows loops to unroll, will depend on the maximum bytes size of vectors registers (AVX256 - 32, AVX512 - 64, SVE up to 256)
>>
>> # Notes
>> Stride size was decreased some time ago https://github.com/openjdk/panama-foreign/commit/2683d5390bd58683ae13bdd8582127c308d8fd04
>>
>> The exact reasons for this are not known for me (over unroll of some loops?).
>>
>> Original thread https://mail.openjdk.java.net/pipermail/panama-dev/2021-June/014310.html
>
> Radoslaw Smogura has updated the pull request incrementally with one additional commit since the last revision:
>
> Adding micro benchmarks
>
> # Optimized
> Benchmark (size) Mode Cnt Score Error Units
> TestLoadStoreShort.array 1048576 avgt 30 20729.206 ? 113.531 ns/op
> TestLoadStoreShort.array 16384 avgt 30 274.495 ? 20.187 ns/op
> TestLoadStoreShort.arrayAdd 1048576 avgt 30 21257.633 ? 212.117 ns/op
> TestLoadStoreShort.arrayAdd 16384 avgt 30 261.173 ? 6.402 ns/op
> TestLoadStoreShort.bufferHeap 1048576 avgt 30 78329.120 ? 222.094 ns/op
> TestLoadStoreShort.bufferHeap 16384 avgt 30 1200.676 ? 14.305 ns/op
> TestLoadStoreShort.bufferNative 1048576 avgt 30 78474.449 ? 262.780 ns/op
> TestLoadStoreShort.bufferNative 16384 avgt 30 1207.160 ? 2.784 ns/op
> TestLoadStoreShort.bufferNativeAdd 1048576 avgt 30 80076.777 ? 586.137 ns/op
> TestLoadStoreShort.bufferNativeAdd 16384 avgt 30 1207.525 ? 7.332 ns/op
> TestLoadStoreShort.bufferSegmentConfined 1048576 avgt 30 100749.706 ? 591.570 ns/op
> TestLoadStoreShort.bufferSegmentConfined 16384 avgt 30 1113.044 ? 7.862 ns/op
> TestLoadStoreShort.bufferSegmentImplicit 1048576 avgt 30 112926.546 ? 460.734 ns/op
> TestLoadStoreShort.bufferSegmentImplicit 16384 avgt 30 1712.764 ? 9.556 ns/op
> TestLoadStoreShort.vectAdd1 1048576 avgt 30 60954.285 ? 643.489 ns/op
> TestLoadStoreShort.vectAdd1 16384 avgt 30 783.505 ? 47.268 ns/op
> TestLoadStoreShort.vectAdd2 1048576 avgt 30 62970.011 ? 392.856 ns/op
> TestLoadStoreShort.vectAdd2 16384 avgt 30 818.670 ? 37.000 ns/op
>
> Benchmark (size) Mode Cnt Score Error Units
> TestLoadStoreBytes.array 1048576 avgt 30 25628.013 ? 585.134 ns/op
> TestLoadStoreBytes.array 16384 avgt 30 313.763 ? 4.118 ns/op
> TestLoadStoreBytes.array2 1048576 avgt 30 28210.376 ? 889.006 ns/op
> TestLoadStoreBytes.array2 16384 avgt 30 374.070 ? 3.979 ns/op
> TestLoadStoreBytes.arrayAdd 1048576 avgt 30 26766.715 ? 569.497 ns/op
> TestLoadStoreBytes.arrayAdd 16384 avgt 30 356.223 ? 5.461 ns/op
> TestLoadStoreBytes.arrayScalar 1048576 avgt 30 21411.246 ? 215.435 ns/op
> TestLoadStoreBytes.arrayScalar 16384 avgt 30 202.638 ? 2.371 ns/op
> TestLoadStoreBytes.bufferHeap 1048576 avgt 30 85093.456 ? 141.605 ns/op
> TestLoadStoreBytes.bufferHeap 16384 avgt 30 1452.955 ? 181.239 ns/op
> TestLoadStoreBytes.bufferHeapScalar 1048576 avgt 30 239887.128 ? 1157.807 ns/op
> TestLoadStoreBytes.bufferHeapScalar 16384 avgt 30 3726.556 ? 14.778 ns/op
> TestLoadStoreBytes.bufferNative 1048576 avgt 30 89906.578 ? 4178.711 ns/op
> TestLoadStoreBytes.bufferNative 16384 avgt 30 1320.245 ? 5.761 ns/op
> TestLoadStoreBytes.bufferNativeScalar 1048576 avgt 30 242911.915 ? 1036.925 ns/op
> TestLoadStoreBytes.bufferNativeScalar 16384 avgt 30 3784.892 ? 9.545 ns/op
> TestLoadStoreBytes.bufferSegmentConfined 1048576 avgt 30 112232.229 ? 333.270 ns/op
> TestLoadStoreBytes.bufferSegmentConfined 16384 avgt 30 1717.749 ? 175.997 ns/op
> TestLoadStoreBytes.bufferSegmentImplicit 1048576 avgt 30 116308.291 ? 771.860 ns/op
> TestLoadStoreBytes.bufferSegmentImplicit 16384 avgt 30 1692.686 ? 7.616 ns/op
> TestLoadStoreBytes.segmentImplicitScalar 1048576 avgt 30 733283.905 ? 3691.582 ns/op
> TestLoadStoreBytes.segmentImplicitScalar 16384 avgt 30 11440.098 ? 55.731 ns/op
> TestLoadStoreBytes.vectAdd1 1048576 avgt 30 34902.208 ? 639.553 ns/op
> TestLoadStoreBytes.vectAdd1 16384 avgt 30 542.248 ? 30.560 ns/op
> TestLoadStoreBytes.vectAdd2 1048576 avgt 30 36448.084 ? 1032.608 ns/op
> TestLoadStoreBytes.vectAdd2 16384 avgt 30 509.069 ? 12.677 ns/op
>
> # Max stride 8
>
> Benchmark (size) Mode Cnt Score Error Units
> TestLoadStoreShort.array 1048576 avgt 30 21924.266 ? 260.754 ns/op
> TestLoadStoreShort.array 16384 avgt 30 308.362 ? 24.404 ns/op
> TestLoadStoreShort.arrayAdd 1048576 avgt 30 21600.363 ? 284.365 ns/op
> TestLoadStoreShort.arrayAdd 16384 avgt 30 262.476 ? 3.419 ns/op
> TestLoadStoreShort.bufferHeap 1048576 avgt 30 77870.222 ? 506.600 ns/op
> TestLoadStoreShort.bufferHeap 16384 avgt 30 1162.587 ? 6.296 ns/op
> TestLoadStoreShort.bufferNative 1048576 avgt 30 79973.889 ? 676.345 ns/op
> TestLoadStoreShort.bufferNative 16384 avgt 30 1210.141 ? 11.058 ns/op
> TestLoadStoreShort.bufferNativeAdd 1048576 avgt 30 79608.287 ? 552.371 ns/op
> TestLoadStoreShort.bufferNativeAdd 16384 avgt 30 1215.755 ? 3.436 ns/op
> TestLoadStoreShort.bufferSegmentConfined 1048576 avgt 30 100683.242 ? 553.136 ns/op
> TestLoadStoreShort.bufferSegmentConfined 16384 avgt 30 1205.342 ? 51.870 ns/op
> TestLoadStoreShort.bufferSegmentImplicit 1048576 avgt 30 112555.011 ? 542.466 ns/op
> TestLoadStoreShort.bufferSegmentImplicit 16384 avgt 30 1738.978 ? 44.425 ns/op
> TestLoadStoreShort.vectAdd1 1048576 avgt 30 62262.555 ? 531.741 ns/op
> TestLoadStoreShort.vectAdd1 16384 avgt 30 840.467 ? 21.841 ns/op
> TestLoadStoreShort.vectAdd2 1048576 avgt 30 62643.137 ? 727.039 ns/op
> TestLoadStoreShort.vectAdd2 16384 avgt 30 798.146 ? 64.926 ns/op
>
> Benchmark (size) Mode Cnt Score Error Units
> TestLoadStoreBytes.array 1048576 avgt 30 28146.073 ? 655.025 ns/op
> TestLoadStoreBytes.array 16384 avgt 30 374.979 ? 5.568 ns/op
> TestLoadStoreBytes.array2 1048576 avgt 30 29526.235 ? 643.623 ns/op
> TestLoadStoreBytes.array2 16384 avgt 30 372.197 ? 2.318 ns/op
> TestLoadStoreBytes.arrayAdd 1048576 avgt 30 29102.706 ? 337.768 ns/op
> TestLoadStoreBytes.arrayAdd 16384 avgt 30 371.534 ? 5.630 ns/op
> TestLoadStoreBytes.arrayScalar 1048576 avgt 30 21157.252 ? 153.367 ns/op
> TestLoadStoreBytes.arrayScalar 16384 avgt 30 198.908 ? 1.664 ns/op
> TestLoadStoreBytes.bufferHeap 1048576 avgt 30 85498.846 ? 401.317 ns/op
> TestLoadStoreBytes.bufferHeap 16384 avgt 30 1285.696 ? 7.873 ns/op
> TestLoadStoreBytes.bufferHeapScalar 1048576 avgt 30 240052.206 ? 1020.145 ns/op
> TestLoadStoreBytes.bufferHeapScalar 16384 avgt 30 3752.597 ? 12.535 ns/op
> TestLoadStoreBytes.bufferNative 1048576 avgt 30 85093.972 ? 244.327 ns/op
> TestLoadStoreBytes.bufferNative 16384 avgt 30 1296.797 ? 6.493 ns/op
> TestLoadStoreBytes.bufferNativeScalar 1048576 avgt 30 238522.752 ? 571.675 ns/op
> TestLoadStoreBytes.bufferNativeScalar 16384 avgt 30 3713.942 ? 13.707 ns/op
> TestLoadStoreBytes.bufferSegmentConfined 1048576 avgt 30 109515.096 ? 536.842 ns/op
> TestLoadStoreBytes.bufferSegmentConfined 16384 avgt 30 1444.239 ? 76.509 ns/op
> TestLoadStoreBytes.bufferSegmentImplicit 1048576 avgt 30 114683.710 ? 763.426 ns/op
> TestLoadStoreBytes.bufferSegmentImplicit 16384 avgt 30 1710.655 ? 68.046 ns/op
> TestLoadStoreBytes.segmentImplicitScalar 1048576 avgt 30 722084.119 ? 4226.589 ns/op
> TestLoadStoreBytes.segmentImplicitScalar 16384 avgt 30 11511.461 ? 418.509 ns/op
> TestLoadStoreBytes.vectAdd1 1048576 avgt 30 36393.430 ? 865.503 ns/op
> TestLoadStoreBytes.vectAdd1 16384 avgt 30 712.802 ? 5.025 ns/op
> TestLoadStoreBytes.vectAdd2 1048576 avgt 30 36823.597 ? 841.554 ns/op
> TestLoadStoreBytes.vectAdd2 16384 avgt 30 500.464 ? 4.657 ns/op
Thank you Vladimir.
Benchmarks add.
> Good. I am testing v04.
>
> Do you have benchmark to verify the fix (vectorization for `byte` and `short` arrays)? Consider adding it to `test/micro/org/openjdk/bench/jdk/incubator/foreign/`
>
> You need second review.
I pushed benchmarks with a new commit (I can't take a lot of credit for benchmarks), and as side note the ByteBuffers improvements are in another question / PR.
I hope this change will not unroll loops that should not get unrolled.
Thank you review, and sorry for the issues with Win build.
-------------
PR: https://git.openjdk.java.net/jdk/pull/4658
More information about the hotspot-compiler-dev
mailing list