RFR: 8312213: Remove unnecessary TEST instructions on x86 when flags reg will already be set
Tobias Hotz
duke at openjdk.org
Mon Jul 24 13:41:54 UTC 2023
On Fri, 26 May 2023 10:32:00 GMT, Tobias Hotz <duke at openjdk.org> wrote:
> This patch adds peephole rules to remove TEST instructions that operate on the result and right after a AND, XOR or OR instruction.
> This pattern can emerge if the result of the and is compared against two values where one of the values is zero. The matcher does not have the capability to know that the instruction mentioned above also set the flag register to the same value.
> According to https://www.felixcloutier.com/x86/and, https://www.felixcloutier.com/x86/xor, https://www.felixcloutier.com/x86/or and https://www.felixcloutier.com/x86/test the flags are set to same values for TEST, AND, XOR and OR, so this should be safe.
> By adding peephole rules to remove the TEST instructions, the resulting assembly code can be shortend and a small speedup can be observed:
> Results on Intel Core i5-8250U CPU
> Before this patch:
>
> Benchmark Mode Cnt Score Error Units
> TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 182.353 ± 1.751 ns/op
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 1.110 ± 0.002 ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 212.836 ± 0.310 ns/op
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 2.072 ± 0.002 ns/op
> TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 72.052 ± 0.215 ns/op
> TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 1.406 ± 0.002 ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 113.396 ± 0.666 ns/op
> TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 1.183 ± 0.001 ns/op
> TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 88.683 ± 2.034 ns/op
> TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 1.406 ± 0.002 ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 113.271 ± 0.602 ns/op
> TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 1.183 ± 0.001 ns/op
>
> After this patch:
>
> Benchmark Mode Cnt Score Error Units Change
> TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 141.615 ± 4.747 ns/op ~29% faster
> TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 1.110 ± 0.002 ns/op (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 213.249 ± 1.094 ns/op (unchanged)
> TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 2.074 ± 0.011...
Thanks for the feedback.
I will definitly look into other instructions such as or where this might be feasable.
Regarding the multiple peepmatchs: It looks like this is not possible with the current system, but I've just updated how the peepmatch block is interpreted in the test_may_remove peepprocedures. This way we can keep the rules consice
Thanks for the feedback!
I double checked, and it seems that other flags than the ZF are used as well. I've added some new cases to cover xor and or in addition to the existing and case, as they all set the same flags as TEST does.
@jaskarth Yeah I still need a JBS entry. If you can create on that would be nice, otherwise I'm going to use the webbug form again.
I benchmarked the new peephole rules and got the following results:
OLD:
Benchmark Mode Cnt Score Error Units
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 71,351 ± 0,176 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 66,763 ± 0,104 ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 57,555 ± 0,341 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 72,158 ± 0,337 ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 61,551 ± 0,346 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 70,554 ± 0,208 ns/op
NEW:
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 65,061 ± 0,096 ns/op (10%)
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 66,855 ± 0,119 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 73,707 ± 0,385 ns/op (-20%)
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 71,062 ± 1,190 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 56,420 ± 1,441 ns/op (10%)
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 69,878 ± 1,318 ns/op (unchanged)
I am currently not quite sure why there is a massive regression in the benchmarkOrTestFusableInt, but it seems to be reproducable at least on my machine. I plan to test another machine soon to see if this behaviour extends to other processors as well. The resulting assembly looks sane, though, and the XOR case also behaves as expected.
I've added a few more benchmarks that are basically the same as the old ones and reran the benchmark on my i5 13600k and my i5 8250U. Here are the results:
Results on Intel Core i5-8250U CPU
Before this patch:
Benchmark Mode Cnt Score Error Units
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 182.353 ± 1.751 ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 1.110 ± 0.002 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 212.836 ± 0.310 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 2.072 ± 0.002 ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 72.052 ± 0.215 ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 1.406 ± 0.002 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 113.396 ± 0.666 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 1.183 ± 0.001 ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 88.683 ± 2.034 ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 1.406 ± 0.002 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 113.271 ± 0.602 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 1.183 ± 0.001 ns/op
After this patch:
Benchmark Mode Cnt Score Error Units Change
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 141.615 ± 4.747 ns/op ~29% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 1.110 ± 0.002 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 213.249 ± 1.094 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 2.074 ± 0.011 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 91.500 ± 0.142 ns/op ~27% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 0.814 ± 0.002 ns/op ~45% faster
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 113.691 ± 0.839 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 1.185 ± 0.001 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 84.953 ± 0.766 ns/op ~4% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 0.814 ± 0.001 ns/op ~72% faster
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 113.158 ± 0.629 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 1.184 ± 0.002 ns/op (unchanged)
Results from my i5 13600k:
Before this patch:
Benchmark Mode Cnt Score Error Units
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 71,368 ± 0,857 ns/op
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 0,430 ± 0,001 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 66,734 ± 0,128 ns/op
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 0,489 ± 0,001 ns/op
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 57,290 ± 0,200 ns/op
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 0,390 ± 0,001 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 70,606 ± 1,595 ns/op
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 0,390 ± 0,001 ns/op
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 62,514 ± 0,735 ns/op
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 0,391 ± 0,001 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 70,886 ± 0,572 ns/op
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 0,391 ± 0,001 ns/op
After this patch:
Benchmark Mode Cnt Score Error Units Change
TestRemovalPeephole.benchmarkAndTestFusableInt avgt 8 65,083 ± 0,246 ns/op ~10% faster
TestRemovalPeephole.benchmarkAndTestFusableIntSingle avgt 8 0,429 ± 0,001 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLong avgt 8 66,533 ± 0,096 ns/op (unchanged)
TestRemovalPeephole.benchmarkAndTestFusableLongSingle avgt 8 0,488 ± 0,001 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableInt avgt 8 74,042 ± 0,412 ns/op ~10% slower
TestRemovalPeephole.benchmarkOrTestFusableIntSingle avgt 8 0,391 ± 0,002 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLong avgt 8 70,850 ± 0,342 ns/op (unchanged)
TestRemovalPeephole.benchmarkOrTestFusableLongSingle avgt 8 0,390 ± 0,002 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableInt avgt 8 56,626 ± 2,274 ns/op ~10% faster
TestRemovalPeephole.benchmarkXorTestFusableIntSingle avgt 8 0,392 ± 0,002 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLong avgt 8 69,942 ± 1,380 ns/op (unchanged)
TestRemovalPeephole.benchmarkXorTestFusableLongSingle avgt 8 0,391 ± 0,001 ns/op (unchanged)
As you can see, there are still some cases where this PR the tests run slower, but especially on older architectures, the speed gains outrank the cases where some performance is lost by a lot. Also, the emitted instructuon sequence is still two bytes per removed test instruction shorter.
Thanks for the feedback!
I agree that this is still far from a perfect solution and tracking the required flags would be ideal, but there many rules that use the flags register, and all of these would need to be tracked. If you are interested in that, feel free to take this PR as a starting point, but I'm currently not interested in pursuing that path. Otherwise we can still keep this PR in its current form IMO as a first step to removing redudant test instructions.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1574151203
PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1608411081
PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1630763334
PR Comment: https://git.openjdk.org/jdk/pull/14172#issuecomment-1635551701
More information about the hotspot-compiler-dev
mailing list