RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v15]

Fri May 2 10:37:56 UTC 2025

On Sun, 27 Apr 2025 11:53:31 GMT, kuaiwei <duke at openjdk.org> wrote:

>> In this patch, I extent the merge stores optimization to merge adjacents loads. Tier1 tests are passed in my machine.
>> 
>> The benchmark result of MergeLoadBench.java
>> AMD EPYC 9T24 96-Core Processor:
>> 
>> |name   | -MergeLoads   | +MergeLoads   |delta|
>> |---|---|---|---|
>> |MergeLoadBench.getCharB        |4352.150       |4407.435       | 55.29 |
>> |MergeLoadBench.getCharBU       |4075.320       |4084.663       | 9.34 |
>> |MergeLoadBench.getCharBV       |3221.302       |3221.528       | 0.23 |
>> |MergeLoadBench.getCharC        |2235.433       |2238.796       | 3.36 |
>> |MergeLoadBench.getCharL        |4363.244       |4372.281       | 9.04 |
>> |MergeLoadBench.getCharLU       |4072.550       |4075.744       | 3.19 |
>> |MergeLoadBench.getCharLV       |2227.825       |2231.612       | 3.79 |
>> |MergeLoadBench.getIntB |11199.935      |6869.030       | -4330.90 |
>> |MergeLoadBench.getIntBU        |6853.862       |2763.923       | -4089.94 |
>> |MergeLoadBench.getIntBV        |306.953        |309.911        | 2.96 |
>> |MergeLoadBench.getIntL |10426.843      |6523.716       | -3903.13 |
>> |MergeLoadBench.getIntLU        |6740.847       |2602.701       | -4138.15 |
>> |MergeLoadBench.getIntLV        |2233.151       |2231.745       | -1.41 |
>> |MergeLoadBench.getIntRB        |11335.756      |8980.619       | -2355.14 |
>> |MergeLoadBench.getIntRBU       |7439.873       |3190.208       | -4249.66 |
>> |MergeLoadBench.getIntRL        |16323.040      |7786.842       | -8536.20 |
>> |MergeLoadBench.getIntRLU       |7457.745       |3364.140       | -4093.61 |
>> |MergeLoadBench.getIntRU        |2512.621       |2511.668       | -0.95 |
>> |MergeLoadBench.getIntU |2501.064       |2500.629       | -0.43 |
>> |MergeLoadBench.getLongB        |21175.442      |21103.660      | -71.78 |
>> |MergeLoadBench.getLongBU       |14042.046      |2512.784       | -11529.26 |
>> |MergeLoadBench.getLongBV       |606.448        |606.171        | -0.28 |
>> |MergeLoadBench.getLongL        |23142.178      |23217.785      | 75.61 |
>> |MergeLoadBench.getLongLU       |14112.972      |2237.659       | -11875.31 |
>> |MergeLoadBench.getLongLV       |2230.416       |2231.224       | 0.81 |
>> |MergeLoadBench.getLongRB       |21152.558      |21140.583      | -11.98 |
>> |MergeLoadBench.getLongRBU      |14031.178      |2520.317       | -11510.86 |
>> |MergeLoadBench.getLongRL       |23248.506      |23136.410      | -112.10 |
>> |MergeLoadBench.getLongRLU      |14125.032      |2240.481       | -11884.55 |
>> |Merg...
>
> kuaiwei has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix build error on mac and windows

I scanned just a part of the code quickly, and have some more comments :)

src/hotspot/share/opto/addnode.cpp line 833:

> 831:   void dump() {
> 832:     tty->print_cr("MergeLoadInfo: load: %s(%d), combine: %d, shift: %d",
> 833:                   _load->Name(), _load->_idx, _combine->_idx, _shift);

What do I get if I dump and invalid `MergeLoadInfo`?

src/hotspot/share/opto/addnode.cpp line 868:

> 866: private:
> 867:   // Detect if the embedding combine node is last one of combine operators
> 868:   bool has_no_merge_load_combine_below( ) const;

Suggestion:

  bool has_no_merge_load_combine_below() const;

src/hotspot/share/opto/addnode.cpp line 974:

> 972: // Construct merge information item from input load
> 973: MergeLoadInfo MergePrimitiveLoads::merge_load_info(LoadNode* load) const {
> 974:   const MergeLoadInfo invalid = MergeLoadInfo();

I would suggest to drop this line, and replace all uses with a `MergeLoadInfo::make_invalid()`, which just returns you such an invalid `MergeLoadInfo`.

src/hotspot/share/opto/addnode.cpp line 975:

> 973: MergeLoadInfo MergePrimitiveLoads::merge_load_info(LoadNode* load) const {
> 974:   const MergeLoadInfo invalid = MergeLoadInfo();
> 975:   const Node* check = bypass_i2l(load);

What does `check` stand for? Might `load_use` be more descriptive?

src/hotspot/share/opto/addnode.cpp line 981:

> 979:   // Check the Load node has the pattern "(Or (LShift (Load .. ) ConI) ..)" or "(Or (Load ..) ..)"
> 980:   for (DUIterator_Fast imax, iter = check->fast_outs(imax); iter < imax; iter++) {
> 981:     Node *out = check->fast_out(iter);

Do you need to loop here?
Do all cases we expect to optimize only have a single use of `check`?
Or what exactly would happen here if we had multiple uses?

Could be good if you had a regression test that triggers such a case with multiple uses here.

src/hotspot/share/opto/addnode.cpp line 990:

> 988:           shift = 0;
> 989:         } else {
> 990:           // Too much Or usages

"Or usages" are countable. "much" is for uncountable, "many" for countable ;)
Suggestion:

          // Too many Or usages

src/hotspot/share/opto/addnode.cpp line 1001:

> 999:         if (shift_oper->outcnt() != 1 ||                                               // Shift should has only one usage
> 1000:             !is_supported_combine_opcode(shift_oper->unique_out()->Opcode()) ||    // Not used by combine operator
> 1001:             !shift_oper->in(2)->is_ConI()) {                                         // Not shift by constant

Please fix the alignment of the comments :)

src/hotspot/share/opto/addnode.cpp line 1039:

> 1037:   // go up through combine operators to find load node
> 1038:   LoadNode* load = nullptr;
> 1039:   Node* oper = _combine;

Can you show some patterns, i.e. what would `load` and `oper` be after this?

src/hotspot/share/opto/addnode.cpp line 1054:

> 1052:     } else {
> 1053:       // not found
> 1054:       add_operators_to_worklist(_combine);

Why are you doing this? If an input still needs to be transformed, then it should be put onto the work list by the inputs of that operator. And not by `combine`, i.e. the use of that operator.

Plus: if those operators are now transformed, would we actually ever get back here and attempt optimizing again? Your flag is now already set with `set_merge_memops_checked`, so we would not get here again, right?

-------------

PR Review: https://git.openjdk.org/jdk/pull/24023#pullrequestreview-2811640092
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071398672
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071391711
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071399977
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071405460
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071406853
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071400902
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071402266
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071419625
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071423301