RFR: 8051725: Questionable if-conversion involving SETNE [v2]

Mon Apr 17 04:37:40 UTC 2023

On Mon, 17 Apr 2023 04:32:29 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruc
 tions that are generated, as that can help in cases where applications are I-Cache bound.
>> 
>> 
>>                                     Baseline                           Patch              Improvement
>> Benchmark                     Mode  Cnt  Score    Error Units     Score    Error  Units
>> Conv2BRules.testEquals0        avgt   10  47.566 ± 0.346 ns/op  /  37.904 ± 1.856  ns/op  + 22.6%
>> Conv2BRules.testNotEquals0     avgt   10  37.167 ± 0.211 ns/op  /  37.352 ± 1.529  ns/op  (unchanged)
>> Conv2BRules.testEquals1        avgt   10  35.059 ± 0.280 ns/op  /  34.847 ± 0.160  ns/op  (unchanged)
>> Conv2BRules.testEqualsNull     avgt   10  56.768 ± 2.600 ns/op  /  46.916 ± 0.308  ns/op  + 19.0%
>> Conv2BRules.testNotEqualsNull  avgt   10  47.447 ± 1.193 ns/op  /  46.974 ± 0.218  ns/op  (unchanged)
>> 
>> 
>> This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated!
>> 
>> Testing: tier1-2 on linux x64, GHA
>
> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Re-work transform to happen in macro expansion

I've reworked the transformation to happen in macro expansion, and it seems the performance is actually *better* now!

                                    Baseline                           Patch              Improvement
Benchmark                      Mode  Cnt  Score    Error Units     Score    Error  Units
Conv2BRules.testEquals0        avgt   10  47.566 ± 0.346 ns/op  /  34.130 ± 0.177  ns/op  + 28.2%
Conv2BRules.testNotEquals0     avgt   10  37.167 ± 0.211 ns/op  /  34.185 ± 0.258  ns/op  + 8.0%
Conv2BRules.testEquals1        avgt   10  35.059 ± 0.280 ns/op  /  34.847 ± 0.160  ns/op  (unchanged)
Conv2BRules.testEqualsNull     avgt   10  56.768 ± 2.600 ns/op  /  34.330 ± 0.625  ns/op  + 39.5%
Conv2BRules.testNotEqualsNull  avgt   10  47.447 ± 1.193 ns/op  /  34.142 ± 0.303  ns/op  + 28.0%

The comparison is now basically the same as doing it with `cmp`, which is nice. It seems the reason is because the assembly now zeroes the register, tests against zero, and then does the `setcc`, instead of comparsion, `setcc`, then `movzbl`. So, it seems that doing the transform in macro expansion is indeed a better choice for x86, as well as reducing the overhead in the matcher. However, I'm not so sure if the benefit will be the same across other platforms as it seems like the different architectures implement `Conv2B` using different strategies. Do you have any thoughts on this approach @merykitty?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1510684128