RFR: 8312213: Remove unnecessary TEST instructions on x86 when flags reg will already be set

Wed Aug 2 15:14:51 UTC 2023

On Fri, 26 May 2023 10:32:00 GMT, Tobias Hotz <duke at openjdk.org> wrote:

> This patch adds peephole rules to remove TEST instructions that operate on the result and right after a AND, XOR or OR instruction.
> This pattern can emerge if the result of the and is compared against two values where one of the values is zero. The matcher does not have the capability to know that the instruction mentioned above also set the flag register to the same value.
> According to https://www.felixcloutier.com/x86/and, https://www.felixcloutier.com/x86/xor, https://www.felixcloutier.com/x86/or and https://www.felixcloutier.com/x86/test the flags are set to same values for TEST, AND, XOR and OR, so this should be safe.
> By adding peephole rules to remove the TEST instructions, the resulting assembly code can be shortend and a small speedup can be observed:
> Results on Intel Core i5-8250U CPU
> Before this patch:
> 
> Benchmark                                              Mode  Cnt    Score    Error  Units
> TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  182.353 ±  1.751  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  212.836 ±  0.310  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.072 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8   72.052 ±  0.215  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8    1.406 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  113.396 ±  0.666  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8    1.183 ±  0.001  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8   88.683 ±  2.034  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8    1.406 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  113.271 ±  0.602  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8    1.183 ±  0.001  ns/op
> 
> After this patch:
> 
> Benchmark                                              Mode  Cnt    Score   Error  Units  Change
> TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  141.615 ± 4.747  ns/op  ~29% faster
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ± 0.002  ns/op  (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  213.249 ± 1.094  ns/op  (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.074 ± 0.011...

First of all, thanks for the feedback.
It already handles the case when flags are set depending on the operation. Right now, there are 3 cases that can be expressed:

- The flag is set based on the result of the operation. That means OF is set if an overflow occured during the Operation, ZF is set if the result is zero, etc. This is expressed through the `KILL CR` effect in combination with the flag `Flag_sets_xxx_flag`
- The flag is zeroed by the instruction. An example for this is the OF and CF flags in case of `and`/`test`. This is expressed through the `KILL CR` effect in combination with the flag `Flag_clears_xxx_flag`
- The state of the flag in unknown/undefined. Examples here are the PF flag in case of the `andn` instruction. This is also the case for all flags for the `bsr` instruction. This is expressed through the `KILL CR` effect and no matching Flag that specifies sets or clears.

I mainly used https://www.felixcloutier.com/x86/index.html as a reference (which is based on the official intel developer manuel) to specify all these flags.
Adding support for additional states such as "Sets flag based on source operand" would take up even more space in the flags int, and as there is only room for 2 additional flags, this would not work, as flags is a juint. Also, I think the cases where we would be able to remove a TEST instruction based on that would be very rare due to 1) Not many instructions that are commonly used settings flags based on the input and 2) The instructions that set flags due to the input not setting many flags, only leaving simple zero checks as valid.

Regarding IR tests: Yeah it seems like that would make sense. I will look into that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1662391908