RFR: JDK-8270147: Increase stride size allowing unrolling more loops [v6]

Wed Jul 14 00:01:16 UTC 2021

On Tue, 13 Jul 2021 23:48:03 GMT, Radoslaw Smogura <github.com+7535718+rsmogura at openjdk.org> wrote:

>> # Description
>> 
>> Increase allowed stride size for loop unrolling to the maximum vector size on runtime platform. 
>> 
>> The motivation for this change is discussion and research about unrolling vector (SIMD) loops. For vector usage, stride size depends on vector element type, and platform vector size. For AVX256 and int stride size is 8, and loop unroll happens. However short and byte loops could not get unrolled (stride size 16 & 32):
>> 
>>     for (int i = 0; i < SPECIES.loopBound(longSize); i += SPECIES.length() /* 8 for int, 16 for short */ ) { 
>>       var v = ShortVector.fromByteBuffer(SPECIES, srcBufferHeap, i, ByteOrder.nativeOrder()); 
>>       v.intoByteBuffer(dstBufferHeap, i, ByteOrder.nativeOrder()); 
>>     } 
>> 
>> After this change, the maximum stride, which allows loops to unroll, will depend on the maximum bytes size of vectors registers (AVX256 - 32, AVX512 - 64, SVE up to 256)
>> 
>> # Notes
>> Stride size was decreased some time ago https://github.com/openjdk/panama-foreign/commit/2683d5390bd58683ae13bdd8582127c308d8fd04
>> 
>> The exact reasons for this are not known for me (over unroll of some loops?).
>> 
>> Original thread https://mail.openjdk.java.net/pipermail/panama-dev/2021-June/014310.html
>
> Radoslaw Smogura has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Adding micro benchmarks
>   
>   # Optimized
>   Benchmark                                  (size)  Mode  Cnt       Score     Error  Units
>   TestLoadStoreShort.array                  1048576  avgt   30   20729.206 ? 113.531  ns/op
>   TestLoadStoreShort.array                    16384  avgt   30     274.495 ?  20.187  ns/op
>   TestLoadStoreShort.arrayAdd               1048576  avgt   30   21257.633 ? 212.117  ns/op
>   TestLoadStoreShort.arrayAdd                 16384  avgt   30     261.173 ?   6.402  ns/op
>   TestLoadStoreShort.bufferHeap             1048576  avgt   30   78329.120 ? 222.094  ns/op
>   TestLoadStoreShort.bufferHeap               16384  avgt   30    1200.676 ?  14.305  ns/op
>   TestLoadStoreShort.bufferNative           1048576  avgt   30   78474.449 ? 262.780  ns/op
>   TestLoadStoreShort.bufferNative             16384  avgt   30    1207.160 ?   2.784  ns/op
>   TestLoadStoreShort.bufferNativeAdd        1048576  avgt   30   80076.777 ? 586.137  ns/op
>   TestLoadStoreShort.bufferNativeAdd          16384  avgt   30    1207.525 ?   7.332  ns/op
>   TestLoadStoreShort.bufferSegmentConfined  1048576  avgt   30  100749.706 ? 591.570  ns/op
>   TestLoadStoreShort.bufferSegmentConfined    16384  avgt   30    1113.044 ?   7.862  ns/op
>   TestLoadStoreShort.bufferSegmentImplicit  1048576  avgt   30  112926.546 ? 460.734  ns/op
>   TestLoadStoreShort.bufferSegmentImplicit    16384  avgt   30    1712.764 ?   9.556  ns/op
>   TestLoadStoreShort.vectAdd1               1048576  avgt   30   60954.285 ? 643.489  ns/op
>   TestLoadStoreShort.vectAdd1                 16384  avgt   30     783.505 ?  47.268  ns/op
>   TestLoadStoreShort.vectAdd2               1048576  avgt   30   62970.011 ? 392.856  ns/op
>   TestLoadStoreShort.vectAdd2                 16384  avgt   30     818.670 ?  37.000  ns/op
>   
>   Benchmark                                  (size)  Mode  Cnt       Score      Error  Units
>   TestLoadStoreBytes.array                  1048576  avgt   30   25628.013 ?  585.134  ns/op
>   TestLoadStoreBytes.array                    16384  avgt   30     313.763 ?    4.118  ns/op
>   TestLoadStoreBytes.array2                 1048576  avgt   30   28210.376 ?  889.006  ns/op
>   TestLoadStoreBytes.array2                   16384  avgt   30     374.070 ?    3.979  ns/op
>   TestLoadStoreBytes.arrayAdd               1048576  avgt   30   26766.715 ?  569.497  ns/op
>   TestLoadStoreBytes.arrayAdd                 16384  avgt   30     356.223 ?    5.461  ns/op
>   TestLoadStoreBytes.arrayScalar            1048576  avgt   30   21411.246 ?  215.435  ns/op
>   TestLoadStoreBytes.arrayScalar              16384  avgt   30     202.638 ?    2.371  ns/op
>   TestLoadStoreBytes.bufferHeap             1048576  avgt   30   85093.456 ?  141.605  ns/op
>   TestLoadStoreBytes.bufferHeap               16384  avgt   30    1452.955 ?  181.239  ns/op
>   TestLoadStoreBytes.bufferHeapScalar       1048576  avgt   30  239887.128 ? 1157.807  ns/op
>   TestLoadStoreBytes.bufferHeapScalar         16384  avgt   30    3726.556 ?   14.778  ns/op
>   TestLoadStoreBytes.bufferNative           1048576  avgt   30   89906.578 ? 4178.711  ns/op
>   TestLoadStoreBytes.bufferNative             16384  avgt   30    1320.245 ?    5.761  ns/op
>   TestLoadStoreBytes.bufferNativeScalar     1048576  avgt   30  242911.915 ? 1036.925  ns/op
>   TestLoadStoreBytes.bufferNativeScalar       16384  avgt   30    3784.892 ?    9.545  ns/op
>   TestLoadStoreBytes.bufferSegmentConfined  1048576  avgt   30  112232.229 ?  333.270  ns/op
>   TestLoadStoreBytes.bufferSegmentConfined    16384  avgt   30    1717.749 ?  175.997  ns/op
>   TestLoadStoreBytes.bufferSegmentImplicit  1048576  avgt   30  116308.291 ?  771.860  ns/op
>   TestLoadStoreBytes.bufferSegmentImplicit    16384  avgt   30    1692.686 ?    7.616  ns/op
>   TestLoadStoreBytes.segmentImplicitScalar  1048576  avgt   30  733283.905 ? 3691.582  ns/op
>   TestLoadStoreBytes.segmentImplicitScalar    16384  avgt   30   11440.098 ?   55.731  ns/op
>   TestLoadStoreBytes.vectAdd1               1048576  avgt   30   34902.208 ?  639.553  ns/op
>   TestLoadStoreBytes.vectAdd1                 16384  avgt   30     542.248 ?   30.560  ns/op
>   TestLoadStoreBytes.vectAdd2               1048576  avgt   30   36448.084 ? 1032.608  ns/op
>   TestLoadStoreBytes.vectAdd2                 16384  avgt   30     509.069 ?   12.677  ns/op
>   
>   # Max stride 8
>   
>   Benchmark                                  (size)  Mode  Cnt       Score     Error  Units
>   TestLoadStoreShort.array                  1048576  avgt   30   21924.266 ? 260.754  ns/op
>   TestLoadStoreShort.array                    16384  avgt   30     308.362 ?  24.404  ns/op
>   TestLoadStoreShort.arrayAdd               1048576  avgt   30   21600.363 ? 284.365  ns/op
>   TestLoadStoreShort.arrayAdd                 16384  avgt   30     262.476 ?   3.419  ns/op
>   TestLoadStoreShort.bufferHeap             1048576  avgt   30   77870.222 ? 506.600  ns/op
>   TestLoadStoreShort.bufferHeap               16384  avgt   30    1162.587 ?   6.296  ns/op
>   TestLoadStoreShort.bufferNative           1048576  avgt   30   79973.889 ? 676.345  ns/op
>   TestLoadStoreShort.bufferNative             16384  avgt   30    1210.141 ?  11.058  ns/op
>   TestLoadStoreShort.bufferNativeAdd        1048576  avgt   30   79608.287 ? 552.371  ns/op
>   TestLoadStoreShort.bufferNativeAdd          16384  avgt   30    1215.755 ?   3.436  ns/op
>   TestLoadStoreShort.bufferSegmentConfined  1048576  avgt   30  100683.242 ? 553.136  ns/op
>   TestLoadStoreShort.bufferSegmentConfined    16384  avgt   30    1205.342 ?  51.870  ns/op
>   TestLoadStoreShort.bufferSegmentImplicit  1048576  avgt   30  112555.011 ? 542.466  ns/op
>   TestLoadStoreShort.bufferSegmentImplicit    16384  avgt   30    1738.978 ?  44.425  ns/op
>   TestLoadStoreShort.vectAdd1               1048576  avgt   30   62262.555 ? 531.741  ns/op
>   TestLoadStoreShort.vectAdd1                 16384  avgt   30     840.467 ?  21.841  ns/op
>   TestLoadStoreShort.vectAdd2               1048576  avgt   30   62643.137 ? 727.039  ns/op
>   TestLoadStoreShort.vectAdd2                 16384  avgt   30     798.146 ?  64.926  ns/op
>   
>   Benchmark                                  (size)  Mode  Cnt       Score      Error  Units
>   TestLoadStoreBytes.array                  1048576  avgt   30   28146.073 ?  655.025  ns/op
>   TestLoadStoreBytes.array                    16384  avgt   30     374.979 ?    5.568  ns/op
>   TestLoadStoreBytes.array2                 1048576  avgt   30   29526.235 ?  643.623  ns/op
>   TestLoadStoreBytes.array2                   16384  avgt   30     372.197 ?    2.318  ns/op
>   TestLoadStoreBytes.arrayAdd               1048576  avgt   30   29102.706 ?  337.768  ns/op
>   TestLoadStoreBytes.arrayAdd                 16384  avgt   30     371.534 ?    5.630  ns/op
>   TestLoadStoreBytes.arrayScalar            1048576  avgt   30   21157.252 ?  153.367  ns/op
>   TestLoadStoreBytes.arrayScalar              16384  avgt   30     198.908 ?    1.664  ns/op
>   TestLoadStoreBytes.bufferHeap             1048576  avgt   30   85498.846 ?  401.317  ns/op
>   TestLoadStoreBytes.bufferHeap               16384  avgt   30    1285.696 ?    7.873  ns/op
>   TestLoadStoreBytes.bufferHeapScalar       1048576  avgt   30  240052.206 ? 1020.145  ns/op
>   TestLoadStoreBytes.bufferHeapScalar         16384  avgt   30    3752.597 ?   12.535  ns/op
>   TestLoadStoreBytes.bufferNative           1048576  avgt   30   85093.972 ?  244.327  ns/op
>   TestLoadStoreBytes.bufferNative             16384  avgt   30    1296.797 ?    6.493  ns/op
>   TestLoadStoreBytes.bufferNativeScalar     1048576  avgt   30  238522.752 ?  571.675  ns/op
>   TestLoadStoreBytes.bufferNativeScalar       16384  avgt   30    3713.942 ?   13.707  ns/op
>   TestLoadStoreBytes.bufferSegmentConfined  1048576  avgt   30  109515.096 ?  536.842  ns/op
>   TestLoadStoreBytes.bufferSegmentConfined    16384  avgt   30    1444.239 ?   76.509  ns/op
>   TestLoadStoreBytes.bufferSegmentImplicit  1048576  avgt   30  114683.710 ?  763.426  ns/op
>   TestLoadStoreBytes.bufferSegmentImplicit    16384  avgt   30    1710.655 ?   68.046  ns/op
>   TestLoadStoreBytes.segmentImplicitScalar  1048576  avgt   30  722084.119 ? 4226.589  ns/op
>   TestLoadStoreBytes.segmentImplicitScalar    16384  avgt   30   11511.461 ?  418.509  ns/op
>   TestLoadStoreBytes.vectAdd1               1048576  avgt   30   36393.430 ?  865.503  ns/op
>   TestLoadStoreBytes.vectAdd1                 16384  avgt   30     712.802 ?    5.025  ns/op
>   TestLoadStoreBytes.vectAdd2               1048576  avgt   30   36823.597 ?  841.554  ns/op
>   TestLoadStoreBytes.vectAdd2                 16384  avgt   30     500.464 ?    4.657  ns/op

Thank you Vladimir.

Benchmarks add.

> Good. I am testing v04.
> 
> Do you have benchmark to verify the fix (vectorization for `byte` and `short` arrays)? Consider adding it to `test/micro/org/openjdk/bench/jdk/incubator/foreign/`
> 
> You need second review.

I pushed benchmarks with a new commit (I can't take a lot of credit for benchmarks), and as side note the ByteBuffers improvements are in another question / PR.

I hope this change will not unroll loops that should not get unrolled.

Thank you review, and sorry for the issues with Win build.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4658