RFR: 8312213: Remove unnecessary TEST instructions on x86 when flags reg will already be set

Mon Jul 24 13:41:54 UTC 2023

On Fri, 26 May 2023 10:32:00 GMT, Tobias Hotz <duke at openjdk.org> wrote:

> This patch adds peephole rules to remove TEST instructions that operate on the result and right after a AND, XOR or OR instruction.
> This pattern can emerge if the result of the and is compared against two values where one of the values is zero. The matcher does not have the capability to know that the instruction mentioned above also set the flag register to the same value.
> According to https://www.felixcloutier.com/x86/and, https://www.felixcloutier.com/x86/xor, https://www.felixcloutier.com/x86/or and https://www.felixcloutier.com/x86/test the flags are set to same values for TEST, AND, XOR and OR, so this should be safe.
> By adding peephole rules to remove the TEST instructions, the resulting assembly code can be shortend and a small speedup can be observed:
> Results on Intel Core i5-8250U CPU
> Before this patch:
> 
> Benchmark                                              Mode  Cnt    Score    Error  Units
> TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  182.353 ±  1.751  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  212.836 ±  0.310  ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.072 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8   72.052 ±  0.215  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8    1.406 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  113.396 ±  0.666  ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8    1.183 ±  0.001  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8   88.683 ±  2.034  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8    1.406 ±  0.002  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  113.271 ±  0.602  ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8    1.183 ±  0.001  ns/op
> 
> After this patch:
> 
> Benchmark                                              Mode  Cnt    Score   Error  Units  Change
> TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  141.615 ± 4.747  ns/op  ~29% faster
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ± 0.002  ns/op  (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  213.249 ± 1.094  ns/op  (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.074 ± 0.011...

Thanks for the feedback.
I will definitly look into other instructions such as or where this might be feasable.
Regarding the multiple peepmatchs: It looks like this is not possible with the current system, but I've just updated how the peepmatch block is interpreted in the test_may_remove peepprocedures. This way we can keep the rules consice

Thanks for the feedback!
I double checked, and it seems that other flags than the ZF are used as well. I've added some new cases to cover xor and or in addition to the existing and case, as they all set the same flags as TEST does.
@jaskarth Yeah I still need a JBS entry. If you can create on that would be nice, otherwise I'm going to use the webbug form again.

I benchmarked the new peephole rules and got the following results:
OLD:

Benchmark                                        Mode  Cnt   Score   Error  Units
TestRemovalPeephole.benchmarkAndTestFusableInt   avgt    8  71,351 ± 0,176  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong  avgt    8  66,763 ± 0,104  ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt    avgt    8  57,555 ± 0,341  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong   avgt    8  72,158 ± 0,337  ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt   avgt    8  61,551 ± 0,346  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong  avgt    8  70,554 ± 0,208  ns/op

NEW:

TestRemovalPeephole.benchmarkAndTestFusableInt   avgt    8  65,061 ± 0,096  ns/op  (10%)
TestRemovalPeephole.benchmarkAndTestFusableLong  avgt    8  66,855 ± 0,119  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt    avgt    8  73,707 ± 0,385  ns/op  (-20%)
TestRemovalPeephole.benchmarkOrTestFusableLong   avgt    8  71,062 ± 1,190  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt   avgt    8  56,420 ± 1,441  ns/op  (10%)
TestRemovalPeephole.benchmarkXorTestFusableLong  avgt    8  69,878 ± 1,318  ns/op  (unchanged)

I am currently not quite sure why there is a massive regression in the benchmarkOrTestFusableInt, but it seems to be reproducable at least on my machine. I plan to test another machine soon to see if this behaviour extends to other processors as well. The resulting assembly looks sane, though, and the XOR case also behaves as expected.

I've added a few more benchmarks that are basically the same as the old ones and reran the benchmark on my i5 13600k and my i5 8250U. Here are the results:
Results on Intel Core i5-8250U CPU
Before this patch:

Benchmark                                              Mode  Cnt    Score    Error  Units
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  182.353 ±  1.751  ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ±  0.002  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  212.836 ±  0.310  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.072 ±  0.002  ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8   72.052 ±  0.215  ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8    1.406 ±  0.002  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  113.396 ±  0.666  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8    1.183 ±  0.001  ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8   88.683 ±  2.034  ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8    1.406 ±  0.002  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  113.271 ±  0.602  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8    1.183 ±  0.001  ns/op

After this patch:

Benchmark                                              Mode  Cnt    Score   Error  Units  Change
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  141.615 ± 4.747  ns/op  ~29% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ± 0.002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  213.249 ± 1.094  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.074 ± 0.011  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8   91.500 ± 0.142  ns/op  ~27% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8    0.814 ± 0.002  ns/op  ~45% faster
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  113.691 ± 0.839  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8    1.185 ± 0.001  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8   84.953 ± 0.766  ns/op  ~4% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8    0.814 ± 0.001  ns/op  ~72% faster
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  113.158 ± 0.629  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8    1.184 ± 0.002  ns/op  (unchanged)

Results from my i5 13600k:
Before this patch:

Benchmark                                              Mode  Cnt   Score    Error  Units
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  71,368 ±  0,857  ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8   0,430 ±  0,001  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  66,734 ±  0,128  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8   0,489 ±  0,001  ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8  57,290 ±  0,200  ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8   0,390 ±  0,001  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  70,606 ±  1,595  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8   0,390 ±  0,001  ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8  62,514 ±  0,735  ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8   0,391 ±  0,001  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  70,886 ±  0,572  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8   0,391 ±  0,001  ns/op

After this patch:

Benchmark                                              Mode  Cnt   Score   Error  Units  Change
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  65,083 ± 0,246  ns/op  ~10% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8   0,429 ± 0,001  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  66,533 ± 0,096  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8   0,488 ± 0,001  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8  74,042 ± 0,412  ns/op  ~10% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8   0,391 ± 0,002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  70,850 ± 0,342  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8   0,390 ± 0,002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8  56,626 ± 2,274  ns/op  ~10% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8   0,392 ± 0,002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  69,942 ± 1,380  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8   0,391 ± 0,001  ns/op  (unchanged)

As you can see, there are still some cases where this PR the tests run slower, but especially on older architectures, the speed gains outrank the cases where some performance is lost by a lot. Also, the emitted instructuon sequence is still two bytes per removed test instruction shorter.

Thanks for the feedback!
I agree that this is still far from a perfect solution and tracking the required flags would be ideal, but there many rules that use the flags register, and all of these would need to be tracked. If you are interested in that, feel free to take this PR as a starting point, but I'm currently not interested in pursuing that path. Otherwise we can still keep this PR in its current form IMO as a first step to removing redudant test instructions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1574151203
PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1608411081
PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1630763334
PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1635551701