AARCH64 optimization: using TBZ instruction for bit check
eric.caspole at oracle.com
eric.caspole at oracle.com
Mon Jun 15 20:53:33 UTC 2020
Thanks, the JMH looks good.
Eric
On 6/13/20 2:24 PM, Boris Ulasevich wrote:
> Hi Eric,
>
> Ok. Here is the webrev with JMH:
> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.01
>
> Thank you,
> Boris
>
> On 12.06.2020 21:24, eric.caspole at oracle.com wrote:
>> Hi Boris,
>> Could you add the JMH to your webrev under
>> test/micro/org/openjdk/bench/?
>> Thanks,
>> Eric
>>
>>
>> On 6/12/20 2:10 PM, Boris Ulasevich wrote:
>>> Hi all,
>>>
>>> Please review the new AARCH64 instruction selection rules.
>>> The change applies TBZ instruction for bit checks: "if ((var&16) ==
>>> 16)".
>>> This makes 17% performance improvement on the benchmark and 5% on a
>>> real application.
>>>
>>> http://bugs.openjdk.java.net/browse/JDK-8247408
>>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00
>>>
>>> - from the full change I excluded far branch test is because it works
>>> a long time, and I'm not sure C2 will not change its behaviour:
>>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.00.plus
>>>
>>> The change was tested on jtreg in fastdebug mode: no regressions.
>>>
>>> thanks,
>>> Boris
>>>
>>> ========================================================================================
>>>
>>> Benchmark Mode Cnt
>>> Score Error Units Score Error
>>> TBZBenchmark.cmpAndBranch2Tbz thrpt 25
>>> 1329060.879 ± 42.780 ops/s 1504990.708 ± 158.096
>>> TBZBenchmark.cmpAndBranch2Tbz:CPI thrpt 5 0.325
>>> ± 0.001 #/op 0.410 ± 0.001
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-load-misses thrpt 5 0.019
>>> ± 0.031 #/op 0.018 ± 0.025
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-loads thrpt 5 16.811
>>> ± 0.791 #/op 16.809 ± 0.914
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-store-misses thrpt 5 0.016
>>> ± 0.017 #/op 0.014 ± 0.022
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-dcache-stores thrpt 5 16.704
>>> ± 0.634 #/op 16.771 ± 0.539
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-icache-load-misses thrpt 5 0.017
>>> ± 0.027 #/op 0.016 ± 0.023
>>> TBZBenchmark.cmpAndBranch2Tbz:L1-icache-loads thrpt 5
>>> 1811.848 ± 3.552 #/op 1148.737 ± 2.993
>>> TBZBenchmark.cmpAndBranch2Tbz:branch-misses thrpt 5 1.013
>>> ± 0.009 #/op 1.011 ± 0.018
>>> TBZBenchmark.cmpAndBranch2Tbz:cycles thrpt 5
>>> 1882.193 ± 3.799 #/op 1662.994 ± 5.935
>>> TBZBenchmark.cmpAndBranch2Tbz:dTLB-load-misses thrpt 5 0.004
>>> ± 0.008 #/op 0.005 ± 0.016
>>> TBZBenchmark.cmpAndBranch2Tbz:dTLB-loads thrpt 5 16.687
>>> ± 0.732 #/op 16.669 ± 0.958
>>> TBZBenchmark.cmpAndBranch2Tbz:iTLB-load-misses thrpt 5 0.003
>>> ± 0.009 #/op 0.003 ± 0.008
>>> TBZBenchmark.cmpAndBranch2Tbz:iTLB-loads thrpt 5
>>> 1586.390 ± 2.612 #/op 1353.981 ± 3.469
>>> TBZBenchmark.cmpAndBranch2Tbz:instructions thrpt 5
>>> 5791.824 ± 15.362 #/op 4055.443 ± 17.785
>>> TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-backend thrpt 5 5.279
>>> ± 1.968 #/op 20.459 ± 5.258
>>> TBZBenchmark.cmpAndBranch2Tbz:stalled-cycles-frontend thrpt 5 66.808
>>> ± 0.700 #/op 12.738 ± 1.040
>>>
>>> public class TBZBenchmark {
>>> @Benchmark
>>> public int cmpAndBranch2Tbz() {
>>> int count = 0;
>>> for (int value = 0; value < 1000; value++) {
>>> if ((value & 32) == 32) {
>>> count--;
>>> } else {
>>> count++;
>>> }
>>> }
>>> return count;
>>> }
>>> }
>>>
>
More information about the hotspot-compiler-dev
mailing list