RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Fri Feb 12 06:01:37 UTC 2021

On Thu, 11 Feb 2021 14:28:01 GMT, Claes Redestad <redestad at openjdk.org> wrote:

> > Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
> 
> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.

BASELINE:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
  61.037 ns/op

Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:Â·perf":
Perf stats:
--------------------------------------------------

         19,739.21 msec task-clock                #    0.389 CPUs utilized          
               646      context-switches          #    0.033 K/sec                  
                12      cpu-migrations            #    0.001 K/sec                  
               150      page-faults               #    0.008 K/sec                  
   74,59,83,59,139      cycles                    #    3.779 GHz                      (30.73%)
 1,78,78,79,19,117      instructions              #    2.40  insn per cycle           (38.48%)
   24,79,81,63,651      branches                  # 1256.289 M/sec                    (38.55%)
      32,24,89,924      branch-misses             #    1.30% of all branches          (38.62%)
   52,56,88,28,472      L1-dcache-loads           # 2663.167 M/sec                    (38.65%)
         39,00,969      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.57%)
          3,74,131      LLC-loads                 #    0.019 M/sec                    (30.77%)
            22,315      LLC-load-misses           #    5.96% of all LL-cache hits     (30.72%)
   <not supported>      L1-icache-loads                                             
         17,49,997      L1-icache-load-misses                                         (30.72%)
   52,91,41,70,636      dTLB-loads                # 2680.663 M/sec                    (30.69%)
             3,315      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.67%)
             4,674      iTLB-loads                #    0.237 K/sec                    (30.65%)
            33,746      iTLB-load-misses          #  721.99% of all iTLB cache hits   (30.63%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      50.723759146 seconds time elapsed

      51.447054000 seconds user
       0.189949000 seconds sys

WITH OPT:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
  74.356 ns/op

Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:Â·perf":
Perf stats:
--------------------------------------------------

         19,741.09 msec task-clock                #    0.389 CPUs utilized          
               641      context-switches          #    0.032 K/sec                  
                17      cpu-migrations            #    0.001 K/sec                  
               164      page-faults               #    0.008 K/sec                  
   74,40,40,48,513      cycles                    #    3.769 GHz                      (30.81%)
 1,45,66,22,06,797      instructions              #    1.96  insn per cycle           (38.56%)
   20,31,28,43,577      branches                  # 1028.963 M/sec                    (38.65%)
         14,11,419      branch-misses             #    0.01% of all branches          (38.69%)
   43,07,86,33,662      L1-dcache-loads           # 2182.182 M/sec                    (38.72%)
         37,06,744      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.56%)
          1,34,292      LLC-loads                 #    0.007 M/sec                    (30.72%)
            30,627      LLC-load-misses           #   22.81% of all LL-cache hits     (30.68%)
   <not supported>      L1-icache-loads                                             
         14,49,145      L1-icache-load-misses                                         (30.65%)
   43,44,86,27,516      dTLB-loads                # 2200.924 M/sec                    (30.63%)
               218      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.63%)
             2,445      iTLB-loads                #    0.124 K/sec                    (30.63%)
            28,624      iTLB-load-misses          # 1170.72% of all iTLB cache hits   (30.63%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      50.716083931 seconds time elapsed

      51.467300000 seconds user
       0.200390000 seconds sys

JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length  of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.

But, there is a significant reduction in instruction count  and cycles are almost comparable.  We are saving one shift per mask computation.

          OLD Sequence:
              0x00007f7fc1030ead:   movabs $0x1,%rax
              0x00007f7fc1030eb7:   shlx   %r8,%rax,%rax
              0x00007f7fc1030ebc:   dec    %rax
              0x00007f7fc1030ebf:   kmovq  %rax,%k2
          NEW Sequence:
              0x00007f775d030d51:   movabs $0xffffffffffffffff,%rax
              0x00007f775d030d5b:   bzhi   %r8,%rax,%rax
              0x00007f775d030d60:   kmovq  %rax,%k2

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522