RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3]

Sat Dec 18 03:09:27 UTC 2021

On Thu, 16 Dec 2021 18:36:31 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

>> Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
>> 
>>  - Merge master.
>>  - enrich tests and do the same transformation for "long".
>>  - include a new optimization for ideal in addnode: Convert "x + x" into "x << 1", and associated tests.
>
> Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`.
> Cheers.

> Thank you both for commenting and suggestions! @merykitty @cl4es
> 
> I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you!

For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code. 

Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this:

  4.40%     0x00007f2b0d6d720c:   mov    %esi,%eax
            0x00007f2b0d6d720e:   and    $0x1fffff,%eax               ;*iushr {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)

i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits).

Baseline build generates this:

  4.46%      0x00007f93e16d988c:   add    %esi,%esi
             0x00007f93e16d988e:   shl    $0xa,%esi
             0x00007f93e16d9891:   mov    %esi,%eax
             0x00007f93e16d9893:   shr    $0xb,%eax                    ;*iushr {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88)

Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations!

Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect:

    @Benchmark
    public int testInt() {
        return helper(ints_a[4711]);
    }

    @Benchmark
    public long testLong() {
        return helper(longs_a[4711]);
    }

Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed.

Baseline:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   15  8.925 ± 0.019  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   15  8.969 ± 0.107  ns/op

Patch:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   15  7.650 ± 0.024  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   15  7.655 ± 0.087  ns/op

A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6675