RFR: 8266951: Partial in-lining for vectorized mismatch operation using AVX512 masked instructions

Mon May 17 11:44:38 UTC 2021

On Fri, 14 May 2021 19:23:40 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Hi @PaulSandoz , after removal of java side changes, I still see good gains for small sizes but there is considerable penalty.
>> Will set the threshold to 0, and re-compute the numbers, seek your inputs on adding target specific THRESHOLD. Could not locate any direct public java API or internal jdk API which could be used to fetch target information.
>
> @jatin-bhateja glad the variation is small. 
> If the subsequent results without and with a zero threshold for lengths below the current threshold show increased benefits i am sure we can find a way to surface up some detail.

Hi @PaulSandoz 

I compared the performance of partial in-lining changes with THRESHOLD (java side threshold in ArraySupports.mismatch*) set to ZERO vs existing values,  motivation for setting THRESHOLD value to ZERO was to compare the performance of Java side scalar compare loop to newly introduced inline sequence. Important point here is that both the scalar tail loop and in-lined sequence are JITed and bypasses the heavy vectorizedMismatch stub call.

Existing Java side threshold  in ArraySupport.mismatch* routines below which scalar tail handles the comparison:
 Byte : 7
 Char/Short : 3
 Integer/Float : 1
 Long/Double :  0

Observations:
1) Scalar loop with existing threshold performs better compared to inline sequence for all the primitive types except byte.
2) For byte performance of scalar loop is better compared to new in-lined sequence for comparison length <= 3. For length > 3 and <= 7 new in-lined sequence give good performance.

For all the other cases listed below which were calling vectorizedMismatch stub up till now, partial in-lining shows significant gains.

 ```
                                        (UsePartialInlineSize = 32)                (UsePartialInlineSize=64)
               (elem cnt/bytes)    AVX3 - YMM register size = 32 bytes           AVX3 - ZMM register size = 64 bytes
   Byte     = 7  (7 bytes)                    25 (25 bytes)                                  57 (57 bytes)
   Short   = 3  (6 bytes)                    13 (26 bytes)                                   29 (58 bytes)
  Int/Float =  1  (4 bytes)                   7  (28 bytes)                                  15 (60 bytes)
  Long   = 0   (0 bytes)                      4  (32 bytes)                                   8 ( 64 bytes)

"Thus the only scope of tuning in existing thresholds is for byte primitive type when compare length is > 3 and <= 7, following is performance variation for these lengths.

Benchmark | SIZE | BaseLine (ops/ms) | PI32 ops/ms (Threshold=0) | PI64 ops/ms (Threshold=0)
-- | -- | -- | -- | --
ArraysMismatchPartialInlining.testByteMatch | 3 | 209915.663 | 175700.411 | 167672.548
ArraysMismatchPartialInlining.testByteMatch | 4 | 157757.866 | 187887.81 | 178366.916
ArraysMismatchPartialInlining.testByteMatch | 5 | 181182.854 | 172835.708 | 154118.205
ArraysMismatchPartialInlining.testByteMatch | 6 | 146279.651 | 173526.975 | 151229.364
ArraysMismatchPartialInlining.testByteMatch | 7 | 139099.287 | 171715.691 | 127025.152
ArraysMismatchPartialInlining.testByteMatch | 15 | 127720.176 | 179272.779 | 161146.445

In general it looks like we can keep the existing thresholds for the time being and take the advantage of partial in-lining for other cases where comparison can fit within one vector.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3999