RFR: 8325467: Support methods with many arguments in C2 [v12]

Tue Apr 1 16:34:24 UTC 2025

On Tue, 1 Apr 2025 14:19:10 GMT, Daniel Lundén <dlunden at openjdk.org> wrote:

>> If a method has a large number of parameters, we currently bail out from C2 compilation.
>> 
>> ### Changeset
>> 
>> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes.
>> 
>> Changes:
>> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed.
>> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable.
>> - Remove all `can_represent` checks and bailouts.
>> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`.
>> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`.
>> 
>> ### Testing
>> 
>> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450)
>> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64.
>> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement.
>> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no...
>
> Daniel Lundén has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Formatting updates
>  - Add register mask fuzzer test

I have gone through the entire changeset now and could not find any obvious functional issue, good job Daniel!

src/hotspot/share/opto/chaitin.cpp line 1425:

> 1423:     // a physical register is found
> 1424:     if (OptoReg::is_reg(assigned)) {
> 1425:       assert(!lrg.mask().is_offset(), "sanity");

Suggestion:

      assert(!lrg.mask().is_offset(), "offset register masks can only contain stack slots");

src/hotspot/share/opto/chaitin.cpp line 1533:

> 1531:     // hesitation).
> 1532:     if (OptoReg::is_valid(reg2) &&
> 1533:         OptoReg::is_reg(reg2 - lrg.mask().offset_bits())) {

I agree that this was probably an oversight in the original code. For simplicity I suggest to replace the check with just `OptoReg::is_reg(reg2)` as you suggest, explicitly limiting the scope of the alternation heuristic to physical registers. I compared the overall effectiveness of post-allocation copy removal (as summarized by `-XX:+PrintOptoStatistics`) between this changeset and your proposed simplification and I cannot see any significant difference. I really wonder if the entire alternation heuristic really has any positive measurable effect, but that investigation belongs to another RFE.

src/hotspot/share/opto/chaitin.cpp line 1591:

> 1589:         // will be a no-op.  (Later on, if lrg runs out of possible colors in
> 1590:         // its chunk, a new chunk of color may be tried, in which case
> 1591:         // examination of neighbors is started again, at retry_next_chunk.)

Doesn't the second part of the comment (`(Later on...)`) still apply after the changes?

src/hotspot/share/opto/chaitin.cpp line 1655:

> 1653:       // Bump register mask up to next stack chunk
> 1654:       bool success = lrg->rollover();
> 1655:       if (!success) {

Was this scenario (running out of stack slots representable in `OptoRegPairs`) possible before, or was it prevented by some check removed in the changeset? Did you come across it in some compilation or is it more of a "theoretical" guard?

src/hotspot/share/opto/chaitin.cpp line 1658:

> 1656:         // We should never get here in practice. Bail out in product,
> 1657:         // assert in debug.
> 1658:         assert(false, "should not happen");

Suggestion:

        assert(false, "the next available stack slots should be within the OptoRegPair range");

src/hotspot/share/opto/chaitin.cpp line 1660:

> 1658:         assert(false, "should not happen");
> 1659:         C->record_method_not_compilable(
> 1660:             "chunk-rollover outside of OptoReg range");

Suggestion:

            "chunk-rollover outside of OptoRegPair range");

src/hotspot/share/opto/regmask.hpp line 282:

> 280:       _grow(src._rm_size, false);
> 281:       memcpy(_RM_UP_EXT, src._RM_UP_EXT,
> 282:              sizeof(uintptr_t) * (src._rm_size - _RM_SIZE));

This code is not very well covered by current tests, please consider adding some tests to `test_regmask.cpp` to exercise it.

src/hotspot/share/opto/regmask.hpp line 293:

> 291:         _hwm = _rm_max();
> 292:       }
> 293:       _set_range(src._rm_size, value, _rm_size - src._rm_size);

This code is not very well covered by current tests, please consider adding some tests to `test_regmask.cpp` to exercise it.

test/jdk/java/lang/invoke/BigArityTest.java line 32:

> 30:  *            (1) have a large number of parameters, and
> 31:  *            (2) use JSR292 methods internally (which increases the
> 32:  *                MaxNodeLimit with a factor of 3)

Just checking: these methods that cause C2 to consume an excessive amount of memory were not C2-compilable before the changeset, right?

-------------

Changes requested by rcastanedalo (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-2733231312
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023172642
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023154419
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023156355
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023177582
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023175078
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023174027
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023183358
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023184495
PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2023195229