[aarch64-port-dev ] AARCH64 optimization: using TBZ instruction for bit check

Boris Ulasevich boris.ulasevich at bell-sw.com
Fri Jun 12 18:10:13 UTC 2020


Hi all,

Please review the new AARCH64 instruction selection rules.
The change applies TBZ instruction for bit checks: "if ((var&16) == 16)".
This makes 17% performance improvement on the benchmark and 5% on a real 
application.

http://bugs.openjdk.java.net/browse/JDK-8247408
http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00

- from the full change I excluded far branch test is because it works a 
long time, and I'm not sure C2 will not change its behaviour:
http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00.plus

The change was tested on jtreg in fastdebug mode: no regressions.

thanks,
Boris

========================================================================================
Benchmark                                               Mode Cnt        
Score    Error  Units           Score     Error
TBZBenchmark.cmpAndBranch2Tbz                          thrpt   25 
1329060.879 ± 42.780  ops/s     1504990.708 ± 158.096
TBZBenchmark.cmpAndBranch2Tbz:CPI                      thrpt 5        
0.325 ±  0.001   #/op           0.410 ±   0.001
TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-load-misses    thrpt 5        
0.019 ±  0.031   #/op           0.018 ±   0.025
TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-loads          thrpt 5 16.811 ±  
0.791   #/op          16.809 ±   0.914
TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-store-misses   thrpt 5        
0.016 ±  0.017   #/op           0.014 ±   0.022
TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-stores         thrpt 5 16.704 ±  
0.634   #/op          16.771 ±   0.539
TBZBenchmark.cmpAndBranch2Tbz:L1-icache-load-misses    thrpt 5        
0.017 ±  0.027   #/op           0.016 ±   0.023
TBZBenchmark.cmpAndBranch2Tbz:L1-icache-loads          thrpt 5 1811.848 
±  3.552   #/op        1148.737 ±   2.993
TBZBenchmark.cmpAndBranch2Tbz:branch-misses            thrpt 5        
1.013 ±  0.009   #/op           1.011 ±   0.018
TBZBenchmark.cmpAndBranch2Tbz:cycles                   thrpt 5 1882.193 
±  3.799   #/op        1662.994 ±   5.935
TBZBenchmark.cmpAndBranch2Tbz:dTLB-load-misses         thrpt 5        
0.004 ±  0.008   #/op           0.005 ±   0.016
TBZBenchmark.cmpAndBranch2Tbz:dTLB-loads               thrpt 5 16.687 ±  
0.732   #/op          16.669 ±   0.958
TBZBenchmark.cmpAndBranch2Tbz:iTLB-load-misses         thrpt 5        
0.003 ±  0.009   #/op           0.003 ±   0.008
TBZBenchmark.cmpAndBranch2Tbz:iTLB-loads               thrpt 5 1586.390 
±  2.612   #/op        1353.981 ±   3.469
TBZBenchmark.cmpAndBranch2Tbz:instructions             thrpt 5 5791.824 
± 15.362   #/op        4055.443 ±  17.785
TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-backend   thrpt 5        
5.279 ±  1.968   #/op          20.459 ±   5.258
TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-frontend  thrpt 5 66.808 ±  
0.700   #/op          12.738 ±   1.040

public class TBZBenchmark {
     @Benchmark
     public int cmpAndBranch2Tbz() {
         int count = 0;
         for (int value = 0; value < 1000; value++) {
             if ((value & 32) == 32) {
                 count--;
             } else {
                 count++;
             }
         }
         return count;
     }
}



More information about the aarch64-port-dev mailing list