RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3]

Sat Dec 18 03:46:21 UTC 2021

On Sat, 18 Dec 2021 02:59:48 GMT, Claes Redestad <redestad at openjdk.org> wrote:

>> Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`.
>> Cheers.
>
>> Thank you both for commenting and suggestions! @merykitty @cl4es
>> 
>> I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you!
> 
> For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code. 
> 
> Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this:
> 
> 
>   4.40%     0x00007f2b0d6d720c:   mov    %esi,%eax
>             0x00007f2b0d6d720e:   and    $0x1fffff,%eax               ;*iushr {reexecute=0 rethrow=0 return_oop=0}
>                                                                       ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)
> 
> 
> i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits).
> 
> Baseline build generates this:
> 
>   4.46%      0x00007f93e16d988c:   add    %esi,%esi
>              0x00007f93e16d988e:   shl    $0xa,%esi
>              0x00007f93e16d9891:   mov    %esi,%eax
>              0x00007f93e16d9893:   shr    $0xb,%eax                    ;*iushr {reexecute=0 rethrow=0 return_oop=0}
>                                                                        ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)
> 
> 
> Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations!
> 
> Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect:
> 
> 
>     @Benchmark
>     public int testInt() {
>         return helper(ints_a[4711]);
>     }
>     
>     @Benchmark
>     public long testLong() {
>         return helper(longs_a[4711]);
>     }
> 
> 
> Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed.
> 
> Baseline:
> 
> Benchmark                         Mode  Cnt  Score   Error  Units
> AddIdeal_XPlusX_LShiftC.testInt   avgt   15  8.925 ± 0.019  ns/op
> AddIdeal_XPlusX_LShiftC.testLong  avgt   15  8.969 ± 0.107  ns/op
> 
> Patch:
> 
> Benchmark                         Mode  Cnt  Score   Error  Units
> AddIdeal_XPlusX_LShiftC.testInt   avgt   15  7.650 ± 0.024  ns/op
> AddIdeal_XPlusX_LShiftC.testLong  avgt   15  7.655 ± 0.087  ns/op
> 
> 
> A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2.

> Thank you @cl4es so much for taking time to review my pull request and deeply investigate the microbenchmark! I learnt a lot on writing better JMH microbenchmarks by reading your comment.

My pleasure. I was intrigued since you picked up on my suggestion to enable folding of constant shift transforms so fast and wanted to see for myself that it ended up compiling down to something more compact/optimal.

> If you get a chance, could you please share the modifications you made to the microbenchmark, if you made other changes than removing `-TieredCompilation` and testing only a single value, so I can test on my end, for more data points? Also, the examples in JMH repository seem out-of-date (e.g., the `Blackhole` autodetect feature) I think your example will be a good study material for me. Thank you.

No other code changes, just added in those two simplified benchmarks and commented out the existing ones. I did tweak the command line to print using `ns` (you can change the annotation to get the same result) and picked the smallest number of iterations that still gave stable results on my machine while trying things out: `-i 15 -wi 15 -tu ns`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6675