RFR: 8266951: Partial in-lining for vectorized mismatch operation using AVX512 masked instructions

Thu May 13 07:29:53 UTC 2021

On Wed, 12 May 2021 19:18:45 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

> My general preference is to retain the existing specification and tail loops. To do that it may be necessary to add platform specific threshold values. Can we investigate whether you can achieve such performance when threshold values are set to zero on platforms that support partial inlining of vectorizedMismatch?

Hi @PaulSandoz,  We do have AVX3Threshold value for platforms supporting AVX512 feature, its default value is currently set to 4096 bytes.  Through partial in-lining we are attempting to generate the comparison code at the call site without calling stub.

Performance data shows the gains for comparisons for sub-word types if the size is less than 32/64 bytes. 
The following algorithm briefly describes the existing implementation of ArraysSupport.mismatch routines. 

ArraySupport.mismatch() {
   if (lenght > THRESHOLD) {
      call ArraySupport.vectorizedMismatch()   // This performs comparison using unsafe APIs at the granularity of 8 bytes.
   } else {
      for ( i = 0 ; i < THRESHOLD ; i++)
          scalar_comparison
   }
}
Java THRESHOLD values for various primitive types and extra headroom which partial inlining offers
                                                          (UsePartialInlineSize = 32)                                        (UsePartialInlineSize=64)
                    (elem cnt/bytes)    AVX3 - YMM register size = 32 bytes                     AVX3 - ZMM register size = 64 bytes
   Byte          = 7  (7 bytes)                                    25 (25 bytes)                                                     57 (57 bytes)                    
   Short        = 3  (6 bytes)                                    13 (26 bytes)                                                     29 (58 bytes)
  Int/Float   =  1  (4 bytes)                                     7  (28 bytes)                                                     15 (60 bytes)     
  Long        = 0   (0 bytes)                                      4  (32 bytes)                                                     8 ( 64 bytes)

Thus we can see that by JITing the comparison code at the call site we can take the advantage of saving the call overhead associated with stub calls which will dominate the cost of comparisons for small-sized compare operations. 

The only penalty which is also visible in the above performance data is for comparison sizes above UsePartialInling size 
we are doing an extra threshold comparison in the JITed code and probably a branch misprediction penalty since the fast path is the immediate block after the comparison.

I can try to limit the patch to only exploit the extra headroom size as shown above since those cases should get the direct benefit out of partial inlining.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3999