RFR: 8051725: Questionable if-conversion involving SETNE [v2]
Jasmine Karthikeyan
jkarthikeyan at openjdk.org
Mon Apr 17 04:37:40 UTC 2023
On Mon, 17 Apr 2023 04:32:29 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>> Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruc
tions that are generated, as that can help in cases where applications are I-Cache bound.
>>
>>
>> Baseline Patch Improvement
>> Benchmark Mode Cnt Score Error Units Score Error Units
>> Conv2BRules.testEquals0 avgt 10 47.566 ± 0.346 ns/op / 37.904 ± 1.856 ns/op + 22.6%
>> Conv2BRules.testNotEquals0 avgt 10 37.167 ± 0.211 ns/op / 37.352 ± 1.529 ns/op (unchanged)
>> Conv2BRules.testEquals1 avgt 10 35.059 ± 0.280 ns/op / 34.847 ± 0.160 ns/op (unchanged)
>> Conv2BRules.testEqualsNull avgt 10 56.768 ± 2.600 ns/op / 46.916 ± 0.308 ns/op + 19.0%
>> Conv2BRules.testNotEqualsNull avgt 10 47.447 ± 1.193 ns/op / 46.974 ± 0.218 ns/op (unchanged)
>>
>>
>> This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated!
>>
>> Testing: tier1-2 on linux x64, GHA
>
> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision:
>
> Re-work transform to happen in macro expansion
I've reworked the transformation to happen in macro expansion, and it seems the performance is actually *better* now!
Baseline Patch Improvement
Benchmark Mode Cnt Score Error Units Score Error Units
Conv2BRules.testEquals0 avgt 10 47.566 ± 0.346 ns/op / 34.130 ± 0.177 ns/op + 28.2%
Conv2BRules.testNotEquals0 avgt 10 37.167 ± 0.211 ns/op / 34.185 ± 0.258 ns/op + 8.0%
Conv2BRules.testEquals1 avgt 10 35.059 ± 0.280 ns/op / 34.847 ± 0.160 ns/op (unchanged)
Conv2BRules.testEqualsNull avgt 10 56.768 ± 2.600 ns/op / 34.330 ± 0.625 ns/op + 39.5%
Conv2BRules.testNotEqualsNull avgt 10 47.447 ± 1.193 ns/op / 34.142 ± 0.303 ns/op + 28.0%
The comparison is now basically the same as doing it with `cmp`, which is nice. It seems the reason is because the assembly now zeroes the register, tests against zero, and then does the `setcc`, instead of comparsion, `setcc`, then `movzbl`. So, it seems that doing the transform in macro expansion is indeed a better choice for x86, as well as reducing the overhead in the matcher. However, I'm not so sure if the benefit will be the same across other platforms as it seems like the different architectures implement `Conv2B` using different strategies. Do you have any thoughts on this approach @merykitty?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1510684128
More information about the hotspot-compiler-dev
mailing list