RFR: 8312213: Remove unnecessary TEST instructions on x86 when flags reg will already be set

Mon Jul 24 13:41:53 UTC 2023

On Fri, 26 May 2023 10:32:00 GMT, Tobias Hotz <duke at openjdk.org> wrote:

> This patch adds peephole rules to remove TEST instructions that operate on the result and right after a AND, XOR or OR instruction.
> This pattern can emerge if the result of the and is compared against two values where one of the values is zero. The matcher does not have the capability to know that the instruction mentioned above also set the flag register to the same value.
> According to https://www.felixcloutier.com/x86/and, https://www.felixcloutier.com/x86/xor, https://www.felixcloutier.com/x86/or and https://www.felixcloutier.com/x86/test the flags are set to same values for TEST, AND, XOR and OR, so this should be safe.
> By adding peephole rules to remove the TEST instructions, the resulting assembly code can be shortend and a small speedup can be observed:
> Results on Intel Core i5-8250U CPU
> Before this patch:
> 
> Benchmark                                              Mode  Cnt    Score    Error  Units
> TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  182.353 ±  1.751  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  212.836 ±  0.310  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.072 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8   72.052 ±  0.215  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8    1.406 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  113.396 ±  0.666  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8    1.183 ±  0.001  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8   88.683 ±  2.034  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8    1.406 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  113.271 ±  0.602  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8    1.183 ±  0.001  ns/op
> 
> After this patch:
> 
> Benchmark                                              Mode  Cnt    Score   Error  Units  Change
> TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  141.615 ± 4.747  ns/op  ~29% faster
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ± 0.002  ns/op  (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  213.249 ± 1.094  ns/op  (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.074 ± 0.011...

> Also, can this be also applied to or, xor, not, andn, popcnt, etc? I think it can also be applied to add and sub if the output of the test only read ZF.

I agree there's potentially more that can be done here. I think one issue is that we don't know which of the flags are actually used by the instruction downstream of the `test` (e.g. `jl`, `jg`, whatever). So, I think we should match the peephole rule on the downstream instructions instead, basically any instruction that accepts a `cmpOp` operator (or a variant that).

We can track per instruction/rule which flags they set (for all instructions that `KILL cr`), and check whether the flags required by the downstream instruction match the flags set by the instruction feeding into the `test`.

src/hotspot/cpu/x86/x86_64.ad line 13796:

> 13794:   ins_pipe(ialu_reg);
> 13795: %}
> 13796: 

In `test_may_remove_5` the `new_root` function points to the constructor of one of these instructions (what's passed to `peepreplace`). Since the function pointer is unused, these are really not needed. (for `peepreplace` you can just pass `testI_reg` or `testL_reg`, it doesn't really matter).

-------------

PR Review: https://git.openjdk.org/jdk/pull/14172#pullrequestreview-1528697844
PR Review Comment: https://git.openjdk.org/jdk/pull/14172#discussion_r1262715419