RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v15]

Fri May 2 10:04:51 UTC 2025

On Sun, 27 Apr 2025 11:53:31 GMT, kuaiwei <duke at openjdk.org> wrote:

>> In this patch, I extent the merge stores optimization to merge adjacents loads. Tier1 tests are passed in my machine.
>> 
>> The benchmark result of MergeLoadBench.java
>> AMD EPYC 9T24 96-Core Processor:
>> 
>> |name   | -MergeLoads   | +MergeLoads   |delta|
>> |---|---|---|---|
>> |MergeLoadBench.getCharB        |4352.150       |4407.435       | 55.29 |
>> |MergeLoadBench.getCharBU       |4075.320       |4084.663       | 9.34 |
>> |MergeLoadBench.getCharBV       |3221.302       |3221.528       | 0.23 |
>> |MergeLoadBench.getCharC        |2235.433       |2238.796       | 3.36 |
>> |MergeLoadBench.getCharL        |4363.244       |4372.281       | 9.04 |
>> |MergeLoadBench.getCharLU       |4072.550       |4075.744       | 3.19 |
>> |MergeLoadBench.getCharLV       |2227.825       |2231.612       | 3.79 |
>> |MergeLoadBench.getIntB |11199.935      |6869.030       | -4330.90 |
>> |MergeLoadBench.getIntBU        |6853.862       |2763.923       | -4089.94 |
>> |MergeLoadBench.getIntBV        |306.953        |309.911        | 2.96 |
>> |MergeLoadBench.getIntL |10426.843      |6523.716       | -3903.13 |
>> |MergeLoadBench.getIntLU        |6740.847       |2602.701       | -4138.15 |
>> |MergeLoadBench.getIntLV        |2233.151       |2231.745       | -1.41 |
>> |MergeLoadBench.getIntRB        |11335.756      |8980.619       | -2355.14 |
>> |MergeLoadBench.getIntRBU       |7439.873       |3190.208       | -4249.66 |
>> |MergeLoadBench.getIntRL        |16323.040      |7786.842       | -8536.20 |
>> |MergeLoadBench.getIntRLU       |7457.745       |3364.140       | -4093.61 |
>> |MergeLoadBench.getIntRU        |2512.621       |2511.668       | -0.95 |
>> |MergeLoadBench.getIntU |2501.064       |2500.629       | -0.43 |
>> |MergeLoadBench.getLongB        |21175.442      |21103.660      | -71.78 |
>> |MergeLoadBench.getLongBU       |14042.046      |2512.784       | -11529.26 |
>> |MergeLoadBench.getLongBV       |606.448        |606.171        | -0.28 |
>> |MergeLoadBench.getLongL        |23142.178      |23217.785      | 75.61 |
>> |MergeLoadBench.getLongLU       |14112.972      |2237.659       | -11875.31 |
>> |MergeLoadBench.getLongLV       |2230.416       |2231.224       | 0.81 |
>> |MergeLoadBench.getLongRB       |21152.558      |21140.583      | -11.98 |
>> |MergeLoadBench.getLongRBU      |14031.178      |2520.317       | -11510.86 |
>> |MergeLoadBench.getLongRL       |23248.506      |23136.410      | -112.10 |
>> |MergeLoadBench.getLongRLU      |14125.032      |2240.481       | -11884.55 |
>> |Merg...
>
> kuaiwei has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix build error on mac and windows

Continuing our conversation from here:
https://github.com/openjdk/jdk/pull/24023#discussion_r2063697381

> I added a flag, _merge_memops_checked in AddNode. It will be checked when search MergePrimitiveLoads::has_no_merge_load_combine_below() . So these cases can be handled. And tests are added.

I'm not a great fan of additional flags in nodes. It is not something I have ever seen. It also increases the size of every `AddNode`, which requires more memory. And there are a lot of nodes that inherit from `AddNode`, and many of them are not even relevant for this optimization, right?

Plus: you might have checked a node before, and marked it as `merge_memops_checked`. But then its inputs could be optimized, to a point where that node could be optimized again. Maybe we don't care too much about that, and don't expect that to happen at the time we run the MergeStores and MergeLoads phase. So maybe this aspect is ok, even if not perfect.

I would feel better if we could just do it with pattern matching only.
Why can you not just go down to the `Or` below, and then look at the other side of that `Or`, and see if that branch could be merged in or not?

src/hotspot/share/opto/addnode.cpp line 1035:

> 1033:     return nullptr;
> 1034:   }
> 1035:   _combine->set_merge_memops_checked(true);

Referenced in comment.

src/hotspot/share/opto/addnode.cpp line 1041:

> 1039:   Node* oper = _combine;
> 1040:   NOT_PRODUCT(int steps = 0;)    // prevent dead loop in bad graph
> 1041:   while (load == nullptr NOT_PRODUCT(&& steps < 30)) {

And just saw this when flying by.
What "bad graph" is this? What is the "dead loop" here?
What happens in product, since there you don't have this check?

-------------

PR Review: https://git.openjdk.org/jdk/pull/24023#pullrequestreview-2811617778
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071379962
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2071377036