RFR: 8312213: Remove unnecessary TEST instructions on x86 when flags reg will already be set
Tobias Hotz
duke at openjdk.org
Mon Jul 24 13:41:51 UTC 2023
This patch adds peephole rules to remove TEST instructions that operate on the result and right after a AND, XOR or OR instruction.
This pattern can emerge if the result of the and is compared against two values where one of the values is zero. The matcher does not have the capability to know that the instruction mentioned above also set the flag register to the same value.
According to https://www.felixcloutier.com/x86/and, https://www.felixcloutier.com/x86/xor, https://www.felixcloutier.com/x86/or and https://www.felixcloutier.com/x86/test the flags are set to same values for TEST, AND, XOR and OR, so this should be safe.
By adding peephole rules to remove the TEST instructions, the resulting assembly code can be shortend and a small speedup can be observed:
Results on Intel Core i5-8250U CPU
Before this patch:
Benchmark Mode Cnt Score Error Units
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 182.353 ± 1.751 ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 1.110 ± 0.002 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 212.836 ± 0.310 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 2.072 ± 0.002 ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 72.052 ± 0.215 ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 1.406 ± 0.002 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 113.396 ± 0.666 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 1.183 ± 0.001 ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 88.683 ± 2.034 ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 1.406 ± 0.002 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 113.271 ± 0.602 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 1.183 ± 0.001 ns/op
After this patch:
Benchmark Mode Cnt Score Error Units Change
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 141.615 ± 4.747 ns/op ~29% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 1.110 ± 0.002 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 213.249 ± 1.094 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 2.074 ± 0.011 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 91.500 ± 0.142 ns/op ~27% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 0.814 ± 0.002 ns/op ~45% faster
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 113.691 ± 0.839 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 1.185 ± 0.001 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 84.953 ± 0.766 ns/op ~4% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 0.814 ± 0.001 ns/op ~72% faster
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 113.158 ± 0.629 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 1.184 ± 0.002 ns/op (unchanged)
Results from my i5 13600k:
Before this patch:
Benchmark Mode Cnt Score Error Units
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 71,368 ± 0,857 ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 0,430 ± 0,001 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 66,734 ± 0,128 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 0,489 ± 0,001 ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 57,290 ± 0,200 ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 0,390 ± 0,001 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 70,606 ± 1,595 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 0,390 ± 0,001 ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 62,514 ± 0,735 ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 0,391 ± 0,001 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 70,886 ± 0,572 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 0,391 ± 0,001 ns/op
After this patch:
Benchmark Mode Cnt Score Error Units Change
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 65,083 ± 0,246 ns/op ~10% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 0,429 ± 0,001 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 66,533 ± 0,096 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 0,488 ± 0,001 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 74,042 ± 0,412 ns/op ~10% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 0,391 ± 0,002 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 70,850 ± 0,342 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 0,390 ± 0,002 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 56,626 ± 2,274 ns/op ~10% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 0,392 ± 0,002 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 69,942 ± 1,380 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 0,391 ± 0,001 ns/op (unchanged)
As you can see, there are some cases where this PR the tests run slower, but especially on older architectures, the speed gains outrank the cases where some performance is lost by a lot. Also, the emitted instructuon sequence is still two bytes per removed test instruction shorter.
The long variants never see any speedup whatsoever, but as other architectures might observe a speedup here as well and the resulting assembly is at least 2 bytes shorter, I think the rules for long still makes sense.
-------------
Commit messages:
- Use a new approach by telling the peephole which rules set and clear which flags
- Merge remote-tracking branch 'upstream/master' into testPeephole
- Remove the old peepreplace empty block - we didn't use them
- Add more benchmark cases
- Add new benchmarks
- Merge remote-tracking branch 'upstream/master' into testPeephole
- Add new matching rules for xor/or
- Do not use stdlib in peephole func
- Add comment indication how the new peepprocedures work
- Merge peephole rules into two large peephole operations
- ... and 5 more: https://git.openjdk.org/jdk/compare/2e12a123...c434ade8
Changes: https://git.openjdk.org/jdk/pull/14172/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14172&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8312213
Stats: 528 lines in 12 files changed: 522 ins; 0 del; 6 mod
Patch: https://git.openjdk.org/jdk/pull/14172.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/14172/head:pull/14172
PR: https://git.openjdk.org/jdk/pull/14172
More information about the hotspot-compiler-dev
mailing list