RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v4]

Tue Jun 17 11:36:32 UTC 2025

On Tue, 17 Jun 2025 11:21:48 GMT, Kuai Wei <kwei at openjdk.org> wrote:

>> @kuaiwei Thanks for the additional detail, that was helpful!
>> 
>> To me, it seems that this test here is almost too rigorous, and hence it might be more expensive than necessary. And that is also what you were worried about, right? You were worried that some checks were done over and over again, and that is why you wanted to cache something. Correct?
>> 
>>        for (AddNode* op in rest_combine_operators) {
>>            // collect mergeable load info list from this combine operator
>>            merge_list = collect_merge_info_list(op)
>>            if (_check is in merge_list) {
>>               // we will not optimize _check, it will be merged when op is optimized
>>               return;
>>            }
>>        }
>> 
>> 
>> Can you say why you need to do all the traversals you do in `collect_merge_info_list`? Why do you need to check all all other uses of the mem-input of the load there?
>> 
>> I was wondering if it is not sufficient to just check if the structure on the other side of `op` has the correct shift and load address offset, so that it could be merged in with the merge list. Do you know what I mean?
>
> Hi @eme64 , as you guess, I think `collect_merge_info_list()` will be invoke multiple times for the same node. If a cache can be used, it can save much cost.
> 
> My understanding about your idea is checking the successor `OrNode` to find the load address and shift offset, to see if it can be adjacent to current combine operator. 
> 
> I have difficult in such a case. For example, there are 8 `LoadByte` and they can be merged as a `LoadLong`. So there are 8 groups of merge info (load, combine, shift) . If current combine operator is in group 4, and the successor combine operator is in group 6. They are not adjacent. The pattern may be rare but it is still a valid graph. From the viewpoint of group 4 , I didn't know if they can be merged or not. I need continue to check output of the successor to see if we can find the missing part. And it may locate in the input chain of current combine operator. So I need check both direction ( input and output) of current combine operator.
> 
> So my design is from the  memory node and collect all mergeable group, and get the max merged group.

@kuaiwei I see. If there are multiple groups, then things look more difficult.

@merykitty Once proposed the idea of not doing MergeStores / MergeLoads as IGVN optimizations, but rather to just have a separate and dedicated phase. At the time, I was against it, because I had already implemented `MergeStores` quite far. But now I'm starting to see it as a possibly better alternative.

That would allow you to take a global view, collect all loads (and stores), put them in a big list, and then make groups that belong together. And then see which groups could be legally replaced with a single load / store. In a way, that is a global vectorizer. And we could handle other patterns than just merging loads and stores: we could also merge copy patterns, for example. That could be much more powerful than the current approach. And it would avoid the issue with having to determine if the current node in IGVN is the best "candidate", or if we should look for another node further down.

I don't know what you think about this complete "rethink" of the approach. But I do think it would be more powerful, and also avoid having to cache results during IGVN. All the "cached" results are local to that dedicated "MergeMemopsPhase" or whatever we would call it.

What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24023#issuecomment-2980007259