RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v4]

Emanuel Peter epeter at openjdk.org
Tue Mar 18 09:05:17 UTC 2025


On Tue, 18 Mar 2025 07:37:12 GMT, kuaiwei <duke at openjdk.org> wrote:

>> In this patch, I extent the merge stores optimization to merge adjacents loads. Tier1 tests are passed in my machine.
>> 
>> The benchmark result of MergeLoadBench.java
>> AMD EPYC 9T24 96-Core Processor:
>> 
>> |name   | -MergeLoads   | +MergeLoads   |delta|
>> |---|---|---|---|
>> |MergeLoadBench.getCharB        |4352.150       |4407.435       | 55.29 |
>> |MergeLoadBench.getCharBU       |4075.320       |4084.663       | 9.34 |
>> |MergeLoadBench.getCharBV       |3221.302       |3221.528       | 0.23 |
>> |MergeLoadBench.getCharC        |2235.433       |2238.796       | 3.36 |
>> |MergeLoadBench.getCharL        |4363.244       |4372.281       | 9.04 |
>> |MergeLoadBench.getCharLU       |4072.550       |4075.744       | 3.19 |
>> |MergeLoadBench.getCharLV       |2227.825       |2231.612       | 3.79 |
>> |MergeLoadBench.getIntB |11199.935      |6869.030       | -4330.90 |
>> |MergeLoadBench.getIntBU        |6853.862       |2763.923       | -4089.94 |
>> |MergeLoadBench.getIntBV        |306.953        |309.911        | 2.96 |
>> |MergeLoadBench.getIntL |10426.843      |6523.716       | -3903.13 |
>> |MergeLoadBench.getIntLU        |6740.847       |2602.701       | -4138.15 |
>> |MergeLoadBench.getIntLV        |2233.151       |2231.745       | -1.41 |
>> |MergeLoadBench.getIntRB        |11335.756      |8980.619       | -2355.14 |
>> |MergeLoadBench.getIntRBU       |7439.873       |3190.208       | -4249.66 |
>> |MergeLoadBench.getIntRL        |16323.040      |7786.842       | -8536.20 |
>> |MergeLoadBench.getIntRLU       |7457.745       |3364.140       | -4093.61 |
>> |MergeLoadBench.getIntRU        |2512.621       |2511.668       | -0.95 |
>> |MergeLoadBench.getIntU |2501.064       |2500.629       | -0.43 |
>> |MergeLoadBench.getLongB        |21175.442      |21103.660      | -71.78 |
>> |MergeLoadBench.getLongBU       |14042.046      |2512.784       | -11529.26 |
>> |MergeLoadBench.getLongBV       |606.448        |606.171        | -0.28 |
>> |MergeLoadBench.getLongL        |23142.178      |23217.785      | 75.61 |
>> |MergeLoadBench.getLongLU       |14112.972      |2237.659       | -11875.31 |
>> |MergeLoadBench.getLongLV       |2230.416       |2231.224       | 0.81 |
>> |MergeLoadBench.getLongRB       |21152.558      |21140.583      | -11.98 |
>> |MergeLoadBench.getLongRBU      |14031.178      |2520.317       | -11510.86 |
>> |MergeLoadBench.getLongRL       |23248.506      |23136.410      | -112.10 |
>> |MergeLoadBench.getLongRLU      |14125.032      |2240.481       | -11884.55 |
>> |Merg...
>
> kuaiwei has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Revert extract value and add more tests

Ok, I gave it a quick first scan, and have some more questions and suggestions :)

src/hotspot/share/opto/memnode.cpp line 1853:

> 1851:  *             +---> Or2 <----+          |
> 1852:  *                    |                  |
> 1853:  *                    +-----> Or3 <------+

The code above has masking, the graph not. Can you add an explanatory comment, please ;)

src/hotspot/share/opto/memnode.cpp line 1855:

> 1853:  *                    +-----> Or3 <------+
> 1854:  *
> 1855:  * It will be transformed as a merged LoadI and replace the Or3 node

Suggestion:

 * It is transformed as a merged LoadI, which replaces the Or3 node.

src/hotspot/share/opto/memnode.cpp line 1862:

> 1860: /*
> 1861:  * LoadNode and OrNode pair which represent an item for merging,
> 1862:  * And we can get some properties like shift and last_op from it.

To me it seems that `load` and `shift` are the relevant elements here, and the `or` and `last_op` are the "side-info", only used if it is the last op. Is that correct?

src/hotspot/share/opto/memnode.cpp line 1877:

> 1875:                                                            _last_op(false) {}
> 1876:   void set_last_op(bool v)            { _last_op = v; }
> 1877:   bool last_op()                const { return _last_op; }

Is it feasible to make all fields `const`? It can often make reasoning about code easier if you know that there can be no modifications. Not sure about `_last_op`, I'll have to keep reading to find out.

src/hotspot/share/opto/memnode.cpp line 1896:

> 1894:     LowToHigh,          // Adjacent and first load access low address
> 1895:     HighToLow,          // Adjacent and first load access high address
> 1896:     NotAdjacent         // Not adjacent

What happens if an `OrNode` has its inputs swapped? This can happen if the node idx are the "wrong way around". See `commute`, comment `Otherwise, sort inputs (commutativity) to help value numbering`.

I don't know how likely this is to happen. What do you think?

src/hotspot/share/opto/memnode.cpp line 1902:

> 1900:   PhaseGVN* const      _phase;
> 1901:   LoadNode* const       _load;
> 1902:   int          _last_op_index;    // Index of the last item in merged_list

What is the `merged_list`? I could not find it.

src/hotspot/share/opto/memnode.cpp line 1976:

> 1974: // Go through ConvI2L which is unique output of the load
> 1975: Node* MergePrimitiveLoads::by_pass_i2l(const LoadNode* l) {
> 1976:   if ( l != nullptr && l->outcnt() == 1 && l->unique_out()->Opcode() == Op_ConvI2L) {

Suggestion:

  if (l != nullptr && l->outcnt() == 1 && l->unique_out()->Opcode() == Op_ConvI2L) {

src/hotspot/share/opto/memnode.cpp line 1979:

> 1977:     return l->unique_out();
> 1978:   } else {
> 1979:     return (Node*)l;

Hmm, I don't like casting away `const`... is there a way to avoid this?

src/hotspot/share/opto/memnode.cpp line 2522:

> 2520:       MergePrimitiveLoads merge(phase, this);
> 2521:       Node* merged = merge.run();
> 2522:       if (merged != nullptr) { return merged; }

I'm a little confused here... So imagine we have a `LoadB` here. How can we now return a `LoadI` instead, and replace all uses of the `LoadB` with the `LoadI`? Should we not be replacing the `OrI` instead?

-------------

PR Review: https://git.openjdk.org/jdk/pull/24023#pullrequestreview-2693479230
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000489282
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000478996
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000495348
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000484074
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000505715
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000508791
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000516974
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000512944
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2000526992


More information about the hotspot-compiler-dev mailing list