RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v2]
Tobias Hartmann
thartmann at openjdk.java.net
Fri Feb 11 09:49:07 UTC 2022
On Wed, 9 Feb 2022 10:23:46 GMT, Roland Westrelin <roland at openjdk.org> wrote:
>> The bytecode of the 2 methods of the benchmark is structured
>> differently: loopsWithSharedLocal(), the slowest one, has multiple
>> backedges with a single head while loopsWithScopedLocal() has a single
>> backedge and all the paths in the loop body merge before the
>> backedge. loopsWithSharedLocal() has its head cloned which results in
>> a 2 loops loop nest.
>>
>> loopsWithSharedLocal() is slow when 2 of the backedges are most
>> commonly taken with one taken only 3 times as often as the other
>> one. So a thread executing that code only runs the inner loop for a
>> few iterations before exiting it and executing the outer loop. I think
>> what happens is that any time the inner loop is entered, some
>> predicates are executed and the overhead of the setup of loop strip
>> mining (if it's enabled) has to be paid. Also, if iteration
>> splitting/unrolling was applied, the main loop is likely never
>> executed and all time is spent in the pre/post loops where potentially
>> some range checks remain.
>>
>> The fix I propose is that ciTypeFlow, when it clone heads, not only
>> rewires the most frequent loop but also all this other frequent loops
>> that share the same head. loopsWithSharedLocal() and
>> loopsWithScopedLocal() are then fairly similar once c2 parses them.
>>
>> Without the patch I measure:
>>
>> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ± 250.463 ns/op
>> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ± 70.372 ns/op
>>
>> with it:
>>
>> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ± 245.873 ns/op
>> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ± 157.912 ns/op
>>
>> But this patch also causes a regression when running one of the
>> benchmarks added by 8278518. From:
>>
>> SharedLoopHeader.sharedHeader avgt 5 505.993 ± 44.126 ns/op
>>
>> to:
>>
>> SharedLoopHeader.sharedHeader avgt 5 724.253 ± 1.664 ns/op
>>
>> The hot method of this benchmark used to be compiled with 2 loops, the
>> inner one a counted loop. With the patch, it's now compiled with a
>> single one which can't be converted into a counted loop because the
>> loop variable is incremented by a different amount along the 2 paths
>> in the loop body. What I propose to fix this is to add a new loop
>> transformation that detects that, because of a merge point, a loop
>> can't be turned into a counted loop and transforms it into 2
>> loops. The benchmark performs better with this:
>>
>> SharedLoopHeader.sharedHeader avgt 5 567.150 ± 6.120 ns/op
>>
>> Not quite on par with the previous score but AFAICT this is due to
>> code generation not being as good (the loop head can't be aligned in
>> particular).
>>
>> In short, I propose:
>>
>> - changing ciTypeFlow so that, when it pays off, a loop with
>> multiple backedges is compiled as a single loop with a merge point in
>> the loop body
>>
>> - adding a new loop transformation so that, when it pays off, a loop
>> with a merge point in the loop body is converted into a 2 loops loop
>> nest, essentially the opposite transformation.
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>
> - fix
> - Merge branch 'master' into JDK-8279888
> - whitespaces
> - fix & test
Testing now passed but I need more time to review this.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7352
More information about the hotspot-compiler-dev
mailing list