AARCH64 optimization: using TBZ instruction for bit check

Boris Ulasevich boris.ulasevich at bell-sw.com
Sat Jun 13 18:24:50 UTC 2020


Hi Eric,

Ok. Here is the webrev with JMH:
http://cr.openjdk.java.net/~bulasevich/8247408/webrev.01

Thank you,
Boris

On 12.06.2020 21:24, eric.caspole at oracle.com wrote:
> Hi Boris,
> Could you add the JMH to your webrev under
> test/micro/org/openjdk/bench/?
> Thanks,
> Eric
>
>
> On 6/12/20 2:10 PM, Boris Ulasevich wrote:
>> Hi all,
>>
>> Please review the new AARCH64 instruction selection rules.
>> The change applies TBZ instruction for bit checks: "if ((var&16) == 
>> 16)".
>> This makes 17% performance improvement on the benchmark and 5% on a 
>> real application.
>>
>> http://bugs.openjdk.java.net/browse/JDK-8247408
>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00
>>
>> - from the full change I excluded far branch test is because it works 
>> a long time, and I'm not sure C2 will not change its behaviour:
>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00.plus
>>
>> The change was tested on jtreg in fastdebug mode: no regressions.
>>
>> thanks,
>> Boris
>>
>> ======================================================================================== 
>>
>> Benchmark                                               Mode Cnt 
>> Score    Error  Units           Score     Error
>> TBZBenchmark.cmpAndBranch2Tbz                          thrpt 25 
>> 1329060.879 ± 42.780  ops/s     1504990.708 ± 158.096
>> TBZBenchmark.cmpAndBranch2Tbz:CPI                      thrpt 5 0.325 
>> ±  0.001   #/op           0.410 ±   0.001
>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-load-misses    thrpt 5 0.019 
>> ±  0.031   #/op           0.018 ±   0.025
>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-loads          thrpt 5 16.811 
>> ± 0.791   #/op          16.809 ±   0.914
>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-store-misses   thrpt 5 0.016 
>> ±  0.017   #/op           0.014 ±   0.022
>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-stores         thrpt 5 16.704 
>> ± 0.634   #/op          16.771 ±   0.539
>> TBZBenchmark.cmpAndBranch2Tbz:L1-icache-load-misses    thrpt 5 0.017 
>> ±  0.027   #/op           0.016 ±   0.023
>> TBZBenchmark.cmpAndBranch2Tbz:L1-icache-loads          thrpt 5 
>> 1811.848 ±  3.552   #/op        1148.737 ±   2.993
>> TBZBenchmark.cmpAndBranch2Tbz:branch-misses            thrpt 5 1.013 
>> ±  0.009   #/op           1.011 ±   0.018
>> TBZBenchmark.cmpAndBranch2Tbz:cycles                   thrpt 5 
>> 1882.193 ±  3.799   #/op        1662.994 ±   5.935
>> TBZBenchmark.cmpAndBranch2Tbz:dTLB-load-misses         thrpt 5 0.004 
>> ±  0.008   #/op           0.005 ±   0.016
>> TBZBenchmark.cmpAndBranch2Tbz:dTLB-loads               thrpt 5 16.687 
>> ± 0.732   #/op          16.669 ±   0.958
>> TBZBenchmark.cmpAndBranch2Tbz:iTLB-load-misses         thrpt 5 0.003 
>> ±  0.009   #/op           0.003 ±   0.008
>> TBZBenchmark.cmpAndBranch2Tbz:iTLB-loads               thrpt 5 
>> 1586.390 ±  2.612   #/op        1353.981 ±   3.469
>> TBZBenchmark.cmpAndBranch2Tbz:instructions             thrpt 5 
>> 5791.824 ± 15.362   #/op        4055.443 ±  17.785
>> TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-backend   thrpt 5 5.279 
>> ±  1.968   #/op          20.459 ±   5.258
>> TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-frontend  thrpt 5 66.808 
>> ± 0.700   #/op          12.738 ±   1.040
>>
>> public class TBZBenchmark {
>>      @Benchmark
>>      public int cmpAndBranch2Tbz() {
>>          int count = 0;
>>          for (int value = 0; value < 1000; value++) {
>>              if ((value & 32) == 32) {
>>                  count--;
>>              } else {
>>                  count++;
>>              }
>>          }
>>          return count;
>>      }
>> }
>>



More information about the hotspot-compiler-dev mailing list