RFR: 8312213: Remove unnecessary TEST instructions on x86 when flags reg will already be set

Mon Jul 24 13:41:51 UTC 2023

This patch adds peephole rules to remove TEST instructions that operate on the result and right after a AND, XOR or OR instruction.
This pattern can emerge if the result of the and is compared against two values where one of the values is zero. The matcher does not have the capability to know that the instruction mentioned above also set the flag register to the same value.
According to https://www.felixcloutier.com/x86/and, https://www.felixcloutier.com/x86/xor, https://www.felixcloutier.com/x86/or and https://www.felixcloutier.com/x86/test the flags are set to same values for TEST, AND, XOR and OR, so this should be safe.
By adding peephole rules to remove the TEST instructions, the resulting assembly code can be shortend and a small speedup can be observed:
Results on Intel Core i5-8250U CPU
Before this patch:

Benchmark                                              Mode  Cnt    Score    Error  Units
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  182.353 ±  1.751  ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ±  0.002  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  212.836 ±  0.310  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.072 ±  0.002  ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8   72.052 ±  0.215  ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8    1.406 ±  0.002  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  113.396 ±  0.666  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8    1.183 ±  0.001  ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8   88.683 ±  2.034  ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8    1.406 ±  0.002  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  113.271 ±  0.602  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8    1.183 ±  0.001  ns/op

After this patch:

Benchmark                                              Mode  Cnt    Score   Error  Units  Change
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  141.615 ± 4.747  ns/op  ~29% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8    1.110 ± 0.002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  213.249 ± 1.094  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8    2.074 ± 0.011  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8   91.500 ± 0.142  ns/op  ~27% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8    0.814 ± 0.002  ns/op  ~45% faster
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  113.691 ± 0.839  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8    1.185 ± 0.001  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8   84.953 ± 0.766  ns/op  ~4% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8    0.814 ± 0.001  ns/op  ~72% faster
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  113.158 ± 0.629  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8    1.184 ± 0.002  ns/op  (unchanged)

Results from my i5 13600k:
Before this patch:

Benchmark                                              Mode  Cnt   Score    Error  Units
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  71,368 ±  0,857  ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8   0,430 ±  0,001  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  66,734 ±  0,128  ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8   0,489 ±  0,001  ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8  57,290 ±  0,200  ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8   0,390 ±  0,001  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  70,606 ±  1,595  ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8   0,390 ±  0,001  ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8  62,514 ±  0,735  ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8   0,391 ±  0,001  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  70,886 ±  0,572  ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8   0,391 ±  0,001  ns/op

After this patch:

Benchmark                                              Mode  Cnt   Score   Error  Units  Change
TestRemovalPeephole.benchmarkAndTestFusableInt         avgt    8  65,083 ± 0,246  ns/op  ~10% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle   avgt    8   0,429 ± 0,001  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong        avgt    8  66,533 ± 0,096  ns/op  (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle  avgt    8   0,488 ± 0,001  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt          avgt    8  74,042 ± 0,412  ns/op  ~10% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle    avgt    8   0,391 ± 0,002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLong         avgt    8  70,850 ± 0,342  ns/op  (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle   avgt    8   0,390 ± 0,002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt         avgt    8  56,626 ± 2,274  ns/op  ~10% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle   avgt    8   0,392 ± 0,002  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLong        avgt    8  69,942 ± 1,380  ns/op  (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle  avgt    8   0,391 ± 0,001  ns/op  (unchanged)

As you can see, there are some cases where this PR the tests run slower, but especially on older architectures, the speed gains outrank the cases where some performance is lost by a lot. Also, the emitted instructuon sequence is still two bytes per removed test instruction shorter.
The long variants never see any speedup whatsoever, but as other architectures might observe a speedup here as well and the resulting assembly is at least 2 bytes shorter, I think the rules for long still makes sense.

-------------

Commit messages:
 - Use a new approach by telling the peephole which rules set and clear which flags
 - Merge remote-tracking branch 'upstream/master' into testPeephole
 - Remove the old peepreplace empty block - we didn't use them
 - Add more benchmark cases
 - Add new benchmarks
 - Merge remote-tracking branch 'upstream/master' into testPeephole
 - Add new matching rules for xor/or
 - Do not use stdlib in peephole func
 - Add comment indication how the new peepprocedures work
 - Merge peephole rules into two large peephole operations
 - ... and 5 more: https://git.openjdk.org/jdk/compare/2e12a123...c434ade8

Changes: https://git.openjdk.org/jdk/pull/14172/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14172&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8312213
  Stats: 528 lines in 12 files changed: 522 ins; 0 del; 6 mod
  Patch: https://git.openjdk.org/jdk/pull/14172.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14172/head:pull/14172

PR: https://git.openjdk.org/jdk/pull/14172