AARCH64 optimization: using TBZ instruction for bit check

Mon Jun 15 20:53:33 UTC 2020

Thanks, the JMH looks good.
Eric

On 6/13/20 2:24 PM, Boris Ulasevich wrote:
> Hi Eric,
> 
> Ok. Here is the webrev with JMH:
> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.01
> 
> Thank you,
> Boris
> 
> On 12.06.2020 21:24, eric.caspole at oracle.com wrote:
>> Hi Boris,
>> Could you add the JMH to your webrev under
>> test/micro/org/openjdk/bench/?
>> Thanks,
>> Eric
>>
>>
>> On 6/12/20 2:10 PM, Boris Ulasevich wrote:
>>> Hi all,
>>>
>>> Please review the new AARCH64 instruction selection rules.
>>> The change applies TBZ instruction for bit checks: "if ((var&16) == 
>>> 16)".
>>> This makes 17% performance improvement on the benchmark and 5% on a 
>>> real application.
>>>
>>> http://bugs.openjdk.java.net/browse/JDK-8247408
>>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00
>>>
>>> - from the full change I excluded far branch test is because it works 
>>> a long time, and I'm not sure C2 will not change its behaviour:
>>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00.plus
>>>
>>> The change was tested on jtreg in fastdebug mode: no regressions.
>>>
>>> thanks,
>>> Boris
>>>
>>> ======================================================================================== 
>>>
>>> Benchmark                                               Mode Cnt 
>>> Score    Error  Units           Score     Error
>>> TBZBenchmark.cmpAndBranch2Tbz                          thrpt 25 
>>> 1329060.879 ± 42.780  ops/s     1504990.708 ± 158.096
>>> TBZBenchmark.cmpAndBranch2Tbz:CPI                      thrpt 5 0.325 
>>> ±  0.001   #/op           0.410 ±   0.001
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-load-misses    thrpt 5 0.019 
>>> ±  0.031   #/op           0.018 ±   0.025
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-loads          thrpt 5 16.811 
>>> ± 0.791   #/op          16.809 ±   0.914
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-store-misses   thrpt 5 0.016 
>>> ±  0.017   #/op           0.014 ±   0.022
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-stores         thrpt 5 16.704 
>>> ± 0.634   #/op          16.771 ±   0.539
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-icache-load-misses    thrpt 5 0.017 
>>> ±  0.027   #/op           0.016 ±   0.023
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-icache-loads          thrpt 5 
>>> 1811.848 ±  3.552   #/op        1148.737 ±   2.993
>>> TBZBenchmark.cmpAndBranch2Tbz:branch-misses            thrpt 5 1.013 
>>> ±  0.009   #/op           1.011 ±   0.018
>>> TBZBenchmark.cmpAndBranch2Tbz:cycles                   thrpt 5 
>>> 1882.193 ±  3.799   #/op        1662.994 ±   5.935
>>> TBZBenchmark.cmpAndBranch2Tbz:dTLB-load-misses         thrpt 5 0.004 
>>> ±  0.008   #/op           0.005 ±   0.016
>>> TBZBenchmark.cmpAndBranch2Tbz:dTLB-loads               thrpt 5 16.687 
>>> ± 0.732   #/op          16.669 ±   0.958
>>> TBZBenchmark.cmpAndBranch2Tbz:iTLB-load-misses         thrpt 5 0.003 
>>> ±  0.009   #/op           0.003 ±   0.008
>>> TBZBenchmark.cmpAndBranch2Tbz:iTLB-loads               thrpt 5 
>>> 1586.390 ±  2.612   #/op        1353.981 ±   3.469
>>> TBZBenchmark.cmpAndBranch2Tbz:instructions             thrpt 5 
>>> 5791.824 ± 15.362   #/op        4055.443 ±  17.785
>>> TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-backend   thrpt 5 5.279 
>>> ±  1.968   #/op          20.459 ±   5.258
>>> TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-frontend  thrpt 5 66.808 
>>> ± 0.700   #/op          12.738 ±   1.040
>>>
>>> public class TBZBenchmark {
>>>      @Benchmark
>>>      public int cmpAndBranch2Tbz() {
>>>          int count = 0;
>>>          for (int value = 0; value < 1000; value++) {
>>>              if ((value & 32) == 32) {
>>>                  count--;
>>>              } else {
>>>                  count++;
>>>              }
>>>          }
>>>          return count;
>>>      }
>>> }
>>>
>