RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v3]

Mon Feb 15 18:55:03 UTC 2021

On Fri, 12 Feb 2021 05:59:01 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>>> Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>> 
>> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
>
>> > Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>> 
>> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
> 
> BASELINE:
> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
>   61.037 ns/op
> 
> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:Â·perf":
> Perf stats:
> --------------------------------------------------
> 
>          19,739.21 msec task-clock                #    0.389 CPUs utilized          
>                646      context-switches          #    0.033 K/sec                  
>                 12      cpu-migrations            #    0.001 K/sec                  
>                150      page-faults               #    0.008 K/sec                  
>    74,59,83,59,139      cycles                    #    3.779 GHz                      (30.73%)
>  1,78,78,79,19,117      instructions              #    2.40  insn per cycle           (38.48%)
>    24,79,81,63,651      branches                  # 1256.289 M/sec                    (38.55%)
>       32,24,89,924      branch-misses             #    1.30% of all branches          (38.62%)
>    52,56,88,28,472      L1-dcache-loads           # 2663.167 M/sec                    (38.65%)
>          39,00,969      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.57%)
>           3,74,131      LLC-loads                 #    0.019 M/sec                    (30.77%)
>             22,315      LLC-load-misses           #    5.96% of all LL-cache hits     (30.72%)
>    <not supported>      L1-icache-loads                                             
>          17,49,997      L1-icache-load-misses                                         (30.72%)
>    52,91,41,70,636      dTLB-loads                # 2680.663 M/sec                    (30.69%)
>              3,315      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.67%)
>              4,674      iTLB-loads                #    0.237 K/sec                    (30.65%)
>             33,746      iTLB-load-misses          #  721.99% of all iTLB cache hits   (30.63%)
>    <not supported>      L1-dcache-prefetches                                        
>    <not supported>      L1-dcache-prefetch-misses                                   
> 
>       50.723759146 seconds time elapsed
> 
>       51.447054000 seconds user
>        0.189949000 seconds sys
> 
> 
> WITH OPT:
> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
>   74.356 ns/op
> 
> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:Â·perf":
> Perf stats:
> --------------------------------------------------
> 
>          19,741.09 msec task-clock                #    0.389 CPUs utilized          
>                641      context-switches          #    0.032 K/sec                  
>                 17      cpu-migrations            #    0.001 K/sec                  
>                164      page-faults               #    0.008 K/sec                  
>    74,40,40,48,513      cycles                    #    3.769 GHz                      (30.81%)
>  1,45,66,22,06,797      instructions              #    1.96  insn per cycle           (38.56%)
>    20,31,28,43,577      branches                  # 1028.963 M/sec                    (38.65%)
>          14,11,419      branch-misses             #    0.01% of all branches          (38.69%)
>    43,07,86,33,662      L1-dcache-loads           # 2182.182 M/sec                    (38.72%)
>          37,06,744      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.56%)
>           1,34,292      LLC-loads                 #    0.007 M/sec                    (30.72%)
>             30,627      LLC-load-misses           #   22.81% of all LL-cache hits     (30.68%)
>    <not supported>      L1-icache-loads                                             
>          14,49,145      L1-icache-load-misses                                         (30.65%)
>    43,44,86,27,516      dTLB-loads                # 2200.924 M/sec                    (30.63%)
>                218      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.63%)
>              2,445      iTLB-loads                #    0.124 K/sec                    (30.63%)
>             28,624      iTLB-load-misses          # 1170.72% of all iTLB cache hits   (30.63%)
>    <not supported>      L1-dcache-prefetches                                        
>    <not supported>      L1-dcache-prefetch-misses                                   
> 
>       50.716083931 seconds time elapsed
> 
>       51.467300000 seconds user
>        0.200390000 seconds sys
> 
> 
> JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length  of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.
> 
> But, there is a significant reduction in instruction count  and cycles are almost comparable.  We are saving one shift per mask computation.
> 
>           OLD Sequence:
>               0x00007f7fc1030ead:   movabs $0x1,%rax
>               0x00007f7fc1030eb7:   shlx   %r8,%rax,%rax
>               0x00007f7fc1030ebc:   dec    %rax
>               0x00007f7fc1030ebf:   kmovq  %rax,%k2
>           NEW Sequence:
>               0x00007f775d030d51:   movabs $0xffffffffffffffff,%rax
>               0x00007f775d030d5b:   bzhi   %r8,%rax,%rax
>               0x00007f775d030d60:   kmovq  %rax,%k2

Further analysis of perf degradation revealed that with new optimized instruction pattern, code alignment got disturbed. This led to increase in LSD misses, also it reduced the UOPs cashing in DSB.  
Aligning copy loops at 32 byte boundary prevents any adverse impact on UOP caching.
NOPs used for padding add up to the instruction count and thus may over shadow the code size gains due to new mask generation sequence in copy stubs.  

Baseline:
 ArrayCopyAligned.testLong   Length : 1200    61 ns/op (approx)
 1,93,44,43,11,622      cycles                                                      
 4,59,57,99,78,727      instructions              #    2.38  insn per cycle         
 1,83,68,75,68,255      idq.dsb_uops                                                
 2,08,32,43,71,906      lsd.uops                                                    
   37,12,54,60,211      idq.mite_uops                                               

With Opt:
 ArrayCopyAligned.testLong   Length : 1200   74 ns/op (approx)
 1,93,51,25,94,766      cycles                                                      
 3,75,11,57,91,917      instructions              #    1.94  insn per cycle         
   48,67,58,25,566      idq.dsb_uops                                                
      19,46,13,236      lsd.uops                                                    
 2,87,42,95,74,280      idq.mite_uops                                               

With Opt + main loop alignment(nop):         61 ns/op (approx)
 ArrayCopyAligned.testLong   Length : 1200 
 1,93,52,15,90,080      cycles                                                      
 4,60,89,14,06,528      instructions              #    2.38  insn per cycle         
 1,78,76,10,34,991      idq.dsb_uops                                                
 2,09,16,15,84,313      lsd.uops                                                    
   46,25,31,92,101      idq.mite_uops            

While computing the mask for partial in-lining of small copy calls ( currently enabled for sub-word types with copy length less than 32/64 bytes), new optimized sequence should always offer lower instruction count and latency path.

Baseline:
ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
 1,97,76,75,18,052      cycles                                                      
 8,96,00,37,11,803      instructions              #    4.53  insn per cycle         
    2,71,83,79,035      idq.dsb_uops                                                
 7,54,82,43,63,409      lsd.uops                                                    
    3,92,55,74,395      idq.mite_uops                                               

ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
 1,97,79,16,56,787      cycles                                                      
 8,96,13,15,69,780      instructions              #    4.53  insn per cycle         
    2,69,07,11,691      idq.dsb_uops                                                
 7,54,95,63,77,683      lsd.uops                                                    
    3,90,19,10,747      idq.mite_uops                                               

WithOpt:
ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
 1,97,66,64,62,541      cycles                                                      
 8,92,03,95,00,236      instructions              #    4.51  insn per cycle         
    2,72,38,56,205      idq.dsb_uops                                                
 7,50,87,50,60,591      lsd.uops                                                    
    3,89,15,02,954      idq.mite_uops                                               

ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
 1,97,54,21,61,110      cycles                                                      
 8,91,46,64,23,754      instructions              #    4.51  insn per cycle         
    2,78,12,19,544      idq.dsb_uops                                                
 7,50,35,88,95,843      lsd.uops                                                    
    3,90,41,97,276      idq.mite_uops                               

Following are the links to updated JMH perf data:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS_LOOP_ALIGN.txt

In general gains are not significant in case of copy stubs, but new sequence offers a optimal latency path for mask computation sequence.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522