RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v4]

Tue Jun 17 11:24:32 UTC 2025

On Mon, 16 Jun 2025 07:07:37 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> @eme64 I added some detail to pseudo code. I hope it can explain my design. 
>> A difficult part of the optimization is that we need find the right combine operator which can be replaced with merged load. Especially it has successor combine operator. We need to find which one is better candidate.
>> For a given combine operator, the `MergePrimitiveLoad::run()` will do like this:
>> 
>> ```c++
>> void MergePrimitiveLoad::run(AddNode* _check) {
>>    // go down through the unique output of _check to collect successor combine operators
>>    GrowableArray<AddNode*> rest_combine_operators = collect_successor_combine_operators(_check)
>> 
>>    if ( !rest_combine_operator.is_empty() ) {
>>        for (AddNode* op in rest_combine_operators) {
>>            // collect mergeable load info list from this combine operator
>>            merge_list = collect_merge_info_list(op)
>>            if (_check is in merge_list) {
>>               // we will not optimize _check, it will be merged when op is optimized
>>               return;
>>            }
>>        }
>>    }
>> 
>>    // all successor combine operators are checked, we can start to optimize the given _check operator
>>    ...
>> }
>> 
>> // For a given combine operator, collect mergeable load info list. Every item
>> // in the result list is a tuple of (Load, combine, shift_value)
>> // It's similar with MergePrimitive::collect_merge_list() in previous patch, more detail can be found in code.
>> 
>> GrowableArray<MergeLoadInfo> collect_merge_info_list(AddNode* combine) {
>>    // get the load from input of combine
>>    LoadNode* load = get_load_from_input(combine);
>>    if (load == nullptr) return;
>> 
>>    MemNode* mem = memory input of load;
>> 
>>    for every output of mem {
>>       if (output->isa_load()) {
>>          // make a merged info for this load node
>>          info = merge_load_info(output);
>>       }
>>       if ( !info->is_invalid() ) {
>>          append info into result list
>>       }
>>    }
>> 
>>    // check the merged bytes and if they are adjacent.
>>    ...
>> 
>>    return result list;
>> }
>
> @kuaiwei Thanks for the additional detail, that was helpful!
> 
> To me, it seems that this test here is almost too rigorous, and hence it might be more expensive than necessary. And that is also what you were worried about, right? You were worried that some checks were done over and over again, and that is why you wanted to cache something. Correct?
> 
>        for (AddNode* op in rest_combine_operators) {
>            // collect mergeable load info list from this combine operator
>            merge_list = collect_merge_info_list(op)
>            if (_check is in merge_list) {
>               // we will not optimize _check, it will be merged when op is optimized
>               return;
>            }
>        }
> 
> 
> Can you say why you need to do all the traversals you do in `collect_merge_info_list`? Why do you need to check all all other uses of the mem-input of the load there?
> 
> I was wondering if it is not sufficient to just check if the structure on the other side of `op` has the correct shift and load address offset, so that it could be merged in with the merge list. Do you know what I mean?

Hi @eme64 , as you guess, I think `collect_merge_info_list()` will be invoke multiple times for the same node. If a cache can be used, it can save much cost.

My understanding about your idea is checking the successor `OrNode` to find the load address and shift offset, to see if it can be adjacent to current combine operator. 

I have difficult in such a case. For example, there are 8 `LoadByte` and they can be merged as a `LoadLong`. So there are 8 groups of merge info (load, combine, shift) . If current combine operator is in group 4, and the successor combine operator is in group 6. They are not adjacent. The pattern may be rare but it is still a valid graph. From the viewpoint of group 4 , I didn't know if they can be merged or not. I need continue to check output of the successor to see if we can find the missing part. And it may locate in the input chain of current combine operator. So I need check both direction ( input and output) of current combine operator.

So my design is from the  memory node and collect all mergeable group, and get the max merged group.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24023#issuecomment-2979972756