RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3]
Zhiqiang Zang
duke at openjdk.java.net
Sat Dec 18 03:28:21 UTC 2021
On Sat, 18 Dec 2021 02:59:48 GMT, Claes Redestad <redestad at openjdk.org> wrote:
>> Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`.
>> Cheers.
>
>> Thank you both for commenting and suggestions! @merykitty @cl4es
>>
>> I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you!
>
> For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code.
>
> Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this:
>
>
> 4.40% 0x00007f2b0d6d720c: mov %esi,%eax
> 0x00007f2b0d6d720e: and $0x1fffff,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0}
> ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)
>
>
> i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits).
>
> Baseline build generates this:
>
> 4.46% 0x00007f93e16d988c: add %esi,%esi
> 0x00007f93e16d988e: shl $0xa,%esi
> 0x00007f93e16d9891: mov %esi,%eax
> 0x00007f93e16d9893: shr $0xb,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0}
> ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)
>
>
> Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations!
>
> Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect:
>
>
> @Benchmark
> public int testInt() {
> return helper(ints_a[4711]);
> }
>
> @Benchmark
> public long testLong() {
> return helper(longs_a[4711]);
> }
>
>
> Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed.
>
> Baseline:
>
> Benchmark Mode Cnt Score Error Units
> AddIdeal_XPlusX_LShiftC.testInt avgt 15 8.925 ± 0.019 ns/op
> AddIdeal_XPlusX_LShiftC.testLong avgt 15 8.969 ± 0.107 ns/op
>
> Patch:
>
> Benchmark Mode Cnt Score Error Units
> AddIdeal_XPlusX_LShiftC.testInt avgt 15 7.650 ± 0.024 ns/op
> AddIdeal_XPlusX_LShiftC.testLong avgt 15 7.655 ± 0.087 ns/op
>
>
> A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2.
Thank you @cl4es so much for taking time to review my pull request and deeply investigate the microbenchmark! I learnt a lot on writing better JMH microbenchmarks by reading your comment.
> > Thank you both for commenting and suggestions! @merykitty @cl4es
> > I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you!
>
> For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code.
>
Great to know we can print assembly! Thanks.
> Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this:
>
> ```
> 4.40% 0x00007f2b0d6d720c: mov %esi,%eax
> 0x00007f2b0d6d720e: and $0x1fffff,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0}
> ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)
> ```
>
> i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits).
>
> Baseline build generates this:
>
> ```
> 4.46% 0x00007f93e16d988c: add %esi,%esi
> 0x00007f93e16d988e: shl $0xa,%esi
> 0x00007f93e16d9891: mov %esi,%eax
> 0x00007f93e16d9893: shr $0xb,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0}
> ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)
> ```
>
> Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations!
>
> Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect:
>
> ```
> @Benchmark
> public int testInt() {
> return helper(ints_a[4711]);
> }
>
> @Benchmark
> public long testLong() {
> return helper(longs_a[4711]);
> }
> ```
>
Thanks for point out this new feature. I did use `Blackhole` but found it brought too much overhead from infra so I referred to JMH examples for a sink method. I'll start using `Backhole` with `-Djmh.blackhole.autoDetect=true` from now on.
> Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed.
>
> Baseline:
>
> ```
> Benchmark Mode Cnt Score Error Units
> AddIdeal_XPlusX_LShiftC.testInt avgt 15 8.925 ± 0.019 ns/op
> AddIdeal_XPlusX_LShiftC.testLong avgt 15 8.969 ± 0.107 ns/op
> ```
>
> Patch:
>
> ```
> Benchmark Mode Cnt Score Error Units
> AddIdeal_XPlusX_LShiftC.testInt avgt 15 7.650 ± 0.024 ns/op
> AddIdeal_XPlusX_LShiftC.testLong avgt 15 7.655 ± 0.087 ns/op
> ```
>
If you get a chance, could you please share the modifications you made to the microbenchmark, if you made other changes than removing `-TieredCompilation` and testing only a single value, so I can test on my end, for more data points? Also, the examples in JMH repository seem out-of-date (e.g., the `Blackhole` autodetect feature) I think your example will be a good study material for me. Thank you.
> A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2.
-------------
PR: https://git.openjdk.java.net/jdk/pull/6675
More information about the hotspot-compiler-dev
mailing list