RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v4]

Thu Jun 19 08:16:33 UTC 2025

On Tue, 17 Jun 2025 11:33:36 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Hi @eme64 , as you guess, I think `collect_merge_info_list()` will be invoke multiple times for the same node. If a cache can be used, it can save much cost.
>> 
>> My understanding about your idea is checking the successor `OrNode` to find the load address and shift offset, to see if it can be adjacent to current combine operator. 
>> 
>> I have difficult in such a case. For example, there are 8 `LoadByte` and they can be merged as a `LoadLong`. So there are 8 groups of merge info (load, combine, shift) . If current combine operator is in group 4, and the successor combine operator is in group 6. They are not adjacent. The pattern may be rare but it is still a valid graph. From the viewpoint of group 4 , I didn't know if they can be merged or not. I need continue to check output of the successor to see if we can find the missing part. And it may locate in the input chain of current combine operator. So I need check both direction ( input and output) of current combine operator.
>> 
>> So my design is from the  memory node and collect all mergeable group, and get the max merged group.
>
> @kuaiwei I see. If there are multiple groups, then things look more difficult.
> 
> @merykitty Once proposed the idea of not doing MergeStores / MergeLoads as IGVN optimizations, but rather to just have a separate and dedicated phase. At the time, I was against it, because I had already implemented `MergeStores` quite far. But now I'm starting to see it as a possibly better alternative.
> 
> That would allow you to take a global view, collect all loads (and stores), put them in a big list, and then make groups that belong together. And then see which groups could be legally replaced with a single load / store. In a way, that is a global vectorizer. And we could handle other patterns than just merging loads and stores: we could also merge copy patterns, for example. That could be much more powerful than the current approach. And it would avoid the issue with having to determine if the current node in IGVN is the best "candidate", or if we should look for another node further down.
> 
> I don't know what you think about this complete "rethink" of the approach. But I do think it would be more powerful, and also avoid having to cache results during IGVN. All the "cached" results are local to that dedicated "MergeMemopsPhase" or whatever we would call it.
> 
> What do you think?

@eme64 , It sounds a good idea. I think a benefit is we can put 'merge memory' optimization before auto vectorization, so it can expose more chance for it. I'm not clear about copy pattern you mentioned. Can you give same example as reference?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24023#issuecomment-2987150398